arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 1855
2510.08564 2026-04-22 cs.AI cs.CV cs.LG

How to Teach Large Multimodal Models New Skills

Zhen Zhu, Yiming Gong, Yao Xiao, Yaoyao Liu, Derek Hoiem

Comments In submission. Code is available at https://github.com/jessemelpolio/LMM_CL

详情
英文摘要

How can we teach large multimodal models (LMMs) new skills without erasing prior abilities? We study sequential fine-tuning on five target skills while monitoring general ability on eight held-out benchmarks across three model families. Surprisingly, we find that performance lost on held-out tasks after fine-tuning on one skill can partly recover when the model is subsequently tuned on a different skill. We trace this behavior to a measurable shift in the output token distribution, manifested through a simple counting-bias probe that shows the shift co-varies with forgetting. Guided by this insight, we identify two simple, robust tuning recipes that learn strongly while limiting drift: (i) updating only the self-attention projection layers (SA Proj., $Δ$ learning +24.9 / $Δ$ held-out forgetting -0.6), and (ii) updating only the MLP Gate&Up while freezing the Down projection (+30.5 / -2.1). Both substantially outperform full-LLM tuning (+31.8 / -23.3) in the learning-forgetting trade-off. We also compare against common forgetting mitigation methods: Learning without Forgetting (LwF), LoRA, Mixture-of-Experts, and weight-space interpolation (WiSE-FT), and find that our selective tuning recipes match or exceed their learning-stability balance while remaining simpler, requiring no replay, auxiliary parameters, or per-stage tuning. These results hold across LLaVA-OneVision, LLaVA-NeXT, and Qwen2.5-VL, confirming that the key to teaching LMMs new skills without forgetting lies in controlling output distribution shift by choosing which components to tune. Code will be made available.

2510.08240 2026-04-22 cs.CL

The Alignment Waltz: Jointly Training Agents to Collaborate for Safety

Jingyu Zhang, Haozhu Wang, Eric Michael Smith, Sid Wang, Amr Sharaf, Mahesh Pasupuleti, Benjamin Van Durme, Daniel Khashabi, Jason Weston, Hongyuan Zhan

Comments ICLR 2026

详情
英文摘要

Harnessing the power of LLMs requires a delicate dance between being helpful and harmless. This creates a fundamental tension between two competing challenges: vulnerability to adversarial attacks that elicit unsafe content, and a tendency for overrefusal on benign but sensitive prompts. Current approaches often navigate this dance with safeguard models that completely reject any content that contains unsafe portions. This approach cuts the music entirely-it may exacerbate overrefusals and fails to provide nuanced guidance for queries it refuses. To teach models a more coordinated choreography, we propose WaltzRL, a novel multi-agent reinforcement learning framework that formulates safety alignment as a collaborative, positive-sum game. WaltzRL jointly trains a conversation agent and a feedback agent, where the latter is incentivized to provide useful suggestions that improve the safety and helpfulness of the conversation agent's responses. At the core of WaltzRL is a Dynamic Improvement Reward (DIR) that evolves over time based on how well the conversation agent incorporates the feedback. At inference time, unsafe or overrefusing responses from the conversation agent are improved rather than discarded. The feedback agent is deployed together with the conversation agent and only engages adaptively when needed, preserving helpfulness and low latency on safe queries. Our experiments, conducted across five diverse datasets, demonstrate that WaltzRL significantly reduces both unsafe responses (e.g., from 39.0% to 4.6% on WildJailbreak) and overrefusals (from 45.3% to 9.9% on OR-Bench) compared to various baselines. By enabling the conversation and feedback agents to co-evolve and adaptively apply feedback, WaltzRL enhances LLM safety without degrading general capabilities, thereby advancing the Pareto front between helpfulness and harmlessness.

2510.08145 2026-04-22 cs.CL

Mitigating Judgment Preference Bias in Large Language Models through Group-Based Polling

Shuliang Liu, Zhipeng Xu, Zhenghao Liu, Yukun Yan, Minghe Yu, Yu Gu, Chong Chen, Huiyuan Xie, Ge Yu

详情
英文摘要

Large Language Models (LLMs) as automatic evaluators, commonly referred to as LLM-as-a-Judge, have also attracted growing attention. This approach plays a vital role in aligning LLMs with human judgments, providing accurate and reliable assessments. However, LLM-based judgment models often exhibit judgment preference bias during the evaluation phase, tending to favor responses generated by themselves, undermining the reliability of their judgments. This paper introduces the Group-Based Polling Optimization (Genii), an unsupervised multi-agent collaborative optimization framework that mitigates the inherent judgment preference bias of judgment models. Specifically, Genii integrates various LLM-based judgment models into a multi-agent system and simulates the interactive client-server polling mechanism to optimize each client agent unsupervisedly. Our experiments demonstrate that Genii outperforms supervised models trained on annotated judgment data, while requiring no human-labeled annotations. Genii consistently improves performance across different client agents during the polling, even when weaker models act as server agents. Further analysis reveals that Genii effectively mitigates judgment preference bias of LLM-based judgment models, demonstrating its effectiveness. All codes are available at https://github.com/NEUIR/Genii.

2510.07037 2026-04-22 cs.CL

Beyond Monolingual Assumptions: A Survey of Code-Switched NLP in the Era of Large Language Models across Modalities

Rajvee Sheth, Samridhi Raj Sinha, Mahavir Patil, Himanshu Beniwal, Mayank Singh

详情
英文摘要

Amidst the rapid advances of large language models (LLMs), most LLMs still struggle with mixed-language inputs, limited Codeswitching (CSW) datasets, and evaluation biases, which hinder their deployment in multilingual societies. This survey provides the first comprehensive analysis of CSW-aware LLM research, reviewing 327 studies spanning five research areas, 15+ NLP tasks, 30+ datasets, and 80+ languages. We categorize recent advances by architecture, training strategy, and evaluation methodology, outlining how LLMs have reshaped CSW modeling and identifying the challenges that persist. The paper concludes with a roadmap that emphasizes the need for inclusive datasets, fair evaluation, and linguistically grounded models to achieve truly multilingual capabilities https://github.com/lingo-iitgn/awesome-code-mixing/.

2510.06860 2026-04-22 cs.LG cs.AI

Towards Generalization of Graph Neural Networks for AC Optimal Power Flow

Olayiwola Arowolo, Jochen L. Cremer

Comments Pre-print has been submitted for review

详情
英文摘要

AC Optimal Power Flow (ACOPF) is computationally intensive for large-scale grids, often requiring prohibitive solution times with conventional solvers. Machine learning offers significant speedups, but existing models struggle with scalability and topology flexibility. To address these challenges, we propose a Hybrid Heterogeneous Message Passing Neural Network (HH-MPNN) that integrates a heterogeneous graph neural network (GNN) with a scalable transformer and physics-informed positional encodings. Our architecture explicitly models distinct power system components to capture local features while using global attention for long-range dependencies. Evaluated on diverse benchmarks, including PGLearn and GridFM-DataKit datasets, HH-MPNN achieves less than 1% optimality gap on default topologies across grid sizes from 14 to 2,000 buses. For N-1 contingencies, our approach demonstrates zero-shot N-1 generalization with less than 3% optimality gap on several test cases despite training only on default topologies. We further develop an approach that ensures robust N-1 generalization to high-impact contingencies through targeted augmentation of the training data, showing that exhaustive simulation is unnecessary for topologically flexible models. Finally, size generalization experiments demonstrate that pre-training on small grids significantly improves performance on large-scale systems. Achieving computational speedups of up to 5,000 times compared to interior point solvers, these results advance practical, generalizable machine learning for real-time power system operations.

2510.05608 2026-04-22 cs.CL

A Goal Without a Plan Is Just a Wish: Efficient and Effective Global Planner Training for Long-Horizon Agent Tasks

Shuzheng Si, Haozhe Zhao, Kangyang Luo, Gang Chen, Fanchao Qi, Minjia Zhang, Baobao Chang, Maosong Sun

Comments ACL 2026

详情
英文摘要

Agents based on large language models (LLMs) struggle with brainless trial-and-error and generating hallucinatory actions due to a lack of global planning in long-horizon tasks. In this paper, we introduce a plan-and-execute framework and propose EAGLET, an efficient and effective planner training method to enhance the executor agent's planning abilities without human effort. Specifically, we train a plug-and-play global planner through a two-step process: we first synthesize high-quality plans from an advanced LLM using our proposed homologous consensus filtering strategy, and apply fine-tuning as a cold start. Moreover, we further improve the planner with a rule-based reinforcement learning stage using a novel executor capability gain reward, ensuring it can handle task instructions of varying difficulty. Experiments on three long-horizon agent tasks show that executor agents equipped with our planner outperform existing methods, achieving new state-of-the-art performance. Meanwhile, EAGLET reduces training costs by 8x compared to RL-based baselines, and it does not require manual effort or extra training data, offering an efficient and effective solution.

2510.05188 2026-04-22 cs.AI

Plug-and-Play Dramaturge: A Divide-and-Conquer Approach for Iterative Narrative Script Refinement via Collaborative LLM Agents

Wenda Xie, Chao Guo, Yanqing Jing, Junle Wang, Yisheng Lv, Fei-Yue Wang

详情
英文摘要

Although LLMs have been widely adopted for creative content generation, a single-pass process often struggles to produce high-quality long narratives. How to effectively revise and improve long narrative scripts like scriptwriters remains a significant challenge, as it demands a comprehensive understanding of the entire context to identify global structural issues and local detailed flaws, as well as coordinating revisions at multiple granularities and locations. Direct modifications by LLMs typically introduce inconsistencies between local edits and the overall narrative requirements. To address these issues, we propose Dramaturge, a task and feature oriented divide-and-conquer approach powered by hierarchical multiple LLM agents. It consists of a Global Review stage to grasp the overall storyline and structural issues, a Scene-level Review stage to pinpoint detailed scene and sentence flaws, and a Hierarchical Coordinated Revision stage that coordinates and integrates structural and detailed improvements throughout the script. The top-down task flow ensures that high-level strategies guide local modifications, maintaining contextual consistency. The review and revision workflow follows a coarse-to-fine iterative process, continuing through multiple rounds until no further substantive improvements can be made. Comprehensive experiments show that Dramaturge significantly outperforms all baselines in terms of script-level overall quality and scene-level details. Our approach is plug-and-play and can be easily integrated into existing methods to improve the generated scripts.

2510.04686 2026-04-22 cs.LG cs.AI

How does the optimizer implicitly bias the model merging loss landscape?

Chenxiang Zhang, Alexander Theus, Damien Teney, Antonio Orvieto, Jun Pang, Sjouke Mauw

Comments Published at ICLR2026

详情
英文摘要

Model merging combines independent solutions with different capabilities into a single one while maintaining the same inference cost. Two popular approaches are linear interpolation, which simply averages multiple model weights, and task arithmetic, which combines task vectors obtained by the difference between finetuned and base models. While useful in practice, what properties make merging effective are poorly understood. This paper explores how the optimization dynamics affect the loss landscape geometry and its impact on merging success. We show that a single quantity -- the effective noise scale -- unifies the impact of different optimizer components on model merging. Across architectures and datasets, merging success is a non-monotonic function of the effective noise scale, with a distinct optimum. Decomposing this quantity, we find that larger learning rates, stronger weight decay, smaller batch sizes, and data augmentation all independently modulate the effective noise scale and exhibit the same qualitative trend. Unlike prior work connecting optimizer noise to the flatness or generalization of individual minima, we show that it also affects the global loss landscape, predicting when independently trained solutions can be successfully merged. Our findings broaden the understanding of how optimization shapes the loss landscape geometry and its consequences for model merging, suggesting that training dynamics could be further manipulated to improve model merging.

2509.26238 2026-04-22 cs.LG

Beyond Linear Probes: Dynamic Safety Monitoring for Language Models

James Oldfield, Philip Torr, Ioannis Patras, Adel Bibi, Fazl Barez

Comments ICLR 2026; Minor revisions and clarifications

详情
英文摘要

Monitoring large language models' (LLMs) activations is an effective way to detect harmful requests before they lead to unsafe outputs. However, traditional safety monitors often require the same amount of compute for every query. This creates a trade-off: expensive monitors waste resources on easy inputs, while cheap ones risk missing subtle cases. We argue that safety monitors should be flexible--costs should rise only when inputs are difficult to assess, or when more compute is available. To achieve this, we introduce Truncated Polynomial Classifiers (TPCs), a natural extension of linear probes for dynamic activation monitoring. Our key insight is that polynomials can be trained and evaluated progressively, term-by-term. At test-time, one can early-stop for lightweight monitoring, or use more terms for stronger guardrails when needed. TPCs provide two modes of use. First, as a safety dial: by evaluating more terms, developers and regulators can "buy" stronger guardrails from the same model. Second, as an adaptive cascade: clear cases exit early after low-order checks, and higher-order guardrails are evaluated only for ambiguous inputs, reducing overall monitoring costs. On WildGuardMix, across 4 models with up to 30B parameters, we show that TPCs compete with or outperform MLP-based probe baselines of the same size for harmful prompt classification, all the while being more interpretable than their black-box counterparts. Our code is available at http://github.com/james-oldfield/tpc.

2509.21080 2026-04-22 cs.CL cs.AI cs.CY

InsideOut: Measuring and Mitigating Insider-Outsider Bias in Interview Script Generation

Yixin Wan, Xingrun Chen, Kai-Wei Chang

详情
英文摘要

Advancements in Large language models (LLMs) have enabled a variety of downstream applications like story and interview script generation. However, recent research raised concerns about culture-related fairness issues in LLM-generated content. In this work, we identify and systematically investigate LLMs' insider-outsider bias, a phenomenon where models position themselves as "insiders" of mainstream cultures during generation while externalizing less dominant cultures. We propose the InsideOut benchmark with 4,000 generation prompts and three evaluation metrics to quantify this bias through a culturally situated interview script generation task, in which an LLM is positioned as a reporter interviewing local people across 10 diverse cultures. Empirical evaluation on 5 state-of-the-art LLMs reveals that while models adopt insider tones in over 88% US-contexted scripts on average, they disproportionately default to "outsider" stances for non-Western cultures. To mitigate these biases, we propose 2 inference-time methods: a baseline prompt-based Fairness Intervention Pillars (FIP) method, and a structured Mitigation via Fairness Agents (MFA) framework consisting of a Single-Agent (MFA-SA), a Hierarchical-Agent (MFA-HA), and an autonomous Agentic Planning (MFA-Plan) pipeline. Empirical results demonstrate that agent-based MFA methods achieve outstanding and robust performance in mitigating the insider-outsider bias: For instance, on the Cultural Alignment Gap (CAG) metric, MFA-SA reduces bias in Llama model by 89.70 % and MFA-HA mitigates bias in Qwen by 82.54%. These findings showcase the effectiveness of agent-based methods as a promising direction for mitigating biases in generative LLMs.

2509.21020 2026-04-22 cs.RO

Hybrid Task and Motion Planning with Reactive Collision Handling for Multi-Robot Disassembly of Complex Products: Application to EV Batteries

Abdelaziz Shaarawy, Cansu Erdogan, Rustam Stolkin, Alireza Rastegarpanah

详情
英文摘要

This paper addresses the problem of multi-robot coordination for complex manipulation task sequences. We present a vision-driven task-and-motion planning (TAMP) framework for a real dual-agent platform that integrates task decomposition and allocation with a learning-based RRT planner. A GMM-informed motion planner is coupled with a hybrid safety layer that combines predictive collision checking in a MoveIt/FCL digital twin with reactive vision-based avoidance and replanning. This integration is challenging as the system jointly satisfies task precedence, geometric feasibility, dynamic obstacle avoidance, and dual-arm coordination constraints. The framework operates in closed loop by updating the remaining task sequence from repeated scene scans and completion-state tracking rather than executing a fixed open-loop plan. In EV battery disassembly experiments, compared with Default-RRTConnect under identical perception and task assignments, the proposed system reduces cumulative end-effector path length from 48.8 to 17.9~m ($-63.3\%$), improves makespan from 467.9 to 429.8~s ($-8.1\%$), and reduces swept volumes (R1: $0.583\rightarrow0.139\,\mathrm{m}^3$, R2: $0.696\rightarrow0.252\,\mathrm{m}^3$) and overlap ($0.064\rightarrow0.034\,\mathrm{m}^3$). These results show that combining predictive planning and reactive collision avoidance in a real dual-arm disassembly scenario improves motion compactness, safety, and scalability to broader multi-robot sequential manipulation tasks.

2509.16343 2026-04-22 cs.CV cs.AI cs.MA

Visual Reasoning Agent: Robust Vision Systems in Remote Sensing via Inference-Time Scaling

Chung-En Johnny Yu, Brian Jalaian, Nathaniel D. Bastian

Comments Accepted to MORS 2026 Artificial Intelligence Workshop Proceedings

详情
英文摘要

Building robust vision systems for high-stakes domains such as remote sensing requires stronger visual reasoning than what single-pass inference typically provides; yet, retraining large models is often computationally expensive and data intensive. We present Visual Reasoning Agent (VRA), a training-free agentic visual reasoning framework that orchestrates off-the-shelf large vision-language models (LVLMs) with a large reasoning model (LRM) through an iterative Think-Critique-Act loop for cross-model verification, self-critique, and recursive refinement. On the remote sensing benchmark VRSBench VQA dataset, VRA consistently outperforms multiple standalone LVLM baselines and achieves up to 40.67\% improvement on challenging question types spanning both perception and reasoning tasks. In addition, integrating three LVLMs with VRA improves the overall accuracy of the standalone LVLMs from 52.8% to 78.8%, demonstrating the effectiveness of agentic reasoning with increased inference-time compute.

2509.13281 2026-04-22 cs.AI cs.CL

RepIt: Steering Language Models with Concept-Specific Refusal Vectors

Vincent Siu, Nathan W. Henry, Nicholas Crispino, Yang Liu, Dawn Song, Chenguang Wang

Comments ICLR 2026

详情
英文摘要

Current safety evaluations of language models rely on benchmark-based assessments that may miss localized vulnerabilities. We present RepIt, a simple and data-efficient framework for isolating concept-specific representations in LM activations. While existing steering methods already achieve high attack success rates through broad interventions, RepIt enables a more concerning capability: selective suppression of refusal on targeted concepts while preserving refusal elsewhere. Across five frontier LMs, RepIt produces evaluation-evading model organisms with semantic backdoors, answering questions related to weapons of mass destruction while still scoring as safe on standard benchmarks. We find the edit of the steering vector localizes to just 100-200 residual dimensions, and robust concept vectors can be extracted from as few as a dozen examples on a single RTX A6000, highlighting how targeted, hard-to-detect modifications can exploit evaluation blind spots with minimal resources. Through demonstrating precise concept disentanglement, this work exposes vulnerabilities in current safety evaluation practices and demonstrates a need for more comprehensive, representation aware assessments.

2509.12516 2026-04-22 cs.RO

Zero to Autonomy in Real-Time: Online Adaptation of Dynamics in Unstructured Environments

William Ward, Sarah Etter, Jesse Quattrociocchi, Christian Ellis, Adam J. Thorpe, Ufuk Topcu

Comments Initial submission to RA-L

详情
英文摘要

Autonomous robots must go from zero prior knowledge to safe control within seconds to operate in unstructured environments. Abrupt terrain changes, such as a sudden transition to ice, create dynamics shifts that can destabilize planners unless the model adapts in real-time. We present a method for online adaptation that combines function encoders with recursive least squares, treating the function encoder coefficients as latent states updated from streaming odometry. This yields constant-time coefficient estimation without gradient-based inner-loop updates, enabling adaptation from only a few seconds of data. We evaluate our approach on a Van der Pol system to highlight algorithmic behavior, in a Unity simulator for high-fidelity off-road navigation, and on a Clearpath Jackal robot, including on a challenging terrain at a local ice rink. Across these settings, our method improves model accuracy and downstream planning, reducing collisions compared to static and meta-learning baselines.

2509.12483 2026-04-22 cs.LG

Benchmarking Physics-Informed Neural Networks and Boundary Elements Methods for Wave Scattering

Oscar Rincón-Cardeno, Gregorio Pérez Bernal, Silvana Montoya Noguera, Nicolás Guarín-Zapata

Comments 17 pages, 4 figures

详情
英文摘要

This study compares the Boundary Element Method (BEM) and Physics-Informed Neural Networks (PINNs) for solving the two-dimensional Helmholtz equation in wave scattering problems. The objective is to evaluate the performance of both methods under the same conditions. We solve the Helmholtz equation using BEM and PINNs for the same scattering problem. PINNs are trained by minimizing the residual of the governing equations and boundary conditions with their configuration determined through hyperparameter optimization, while BEM is applied using boundary discretization. Both methods are evaluated in terms of solution accuracy and computation time. We conducted numerical experiments by varying the number of boundary integration points for the BEM and the number of hidden layers and neurons per layer for the PINNs. We performed a hyperparameter tuning to identify an adequate PINN configuration for this problem as a network with 3 hidden layers and 25 neurons per layer, using a learning rate of $10^{-2}$ and a sine activation function. At comparable levels of accuracy, the assembly and solution of the BEM system required a computational time on the order of $10^{-2}$~s, whereas the training time of the PINN was on the order of $10^{2}$~s, corresponding to a difference of approximately four orders of magnitude. However, once trained, the PINN achieved evaluation times on the order of $10^{-2}$~s, which is about two orders of magnitude faster than the evaluation of the BEM solution at interior points. This work establishes a procedure for comparing BEM and PINNs. It also presents a direct comparison between the two methods for the scattering problem. The analysis provides quantitative data on their performance, supporting their use in future research on wave propagation problems and outlining challenges and directions for further investigation.

2509.12052 2026-04-22 cs.CV

FluentAvatar: Flicker-Free Talking-Head Animation via Phoneme-Guided Autoregressive Modeling

Yuchen Deng, Xiuyang Wu, Hai-Tao Zheng, Suiyang Zhang, Yi He, Yuxing Han

详情
英文摘要

Current talking-head generation has gradually shifted from GAN-based methods to diffusion-based paradigms, achieving remarkable progress in visual fidelity and temporal consistency. However, inter-frame flicker remains prevalent in existing diffusion-based methods. An important reason is that denoising trajectory variation induced by stochastic initialization leaves residual inter-frame inconsistencies, which manifest as short-term, abrupt visual fluctuations between adjacent frames. To further verify this, we conduct a controlled study by fixing the input while varying only the random seed. The results show markedly different flicker patterns across samplings, with a mean inter-seed Pearson correlation of only r = 0.15. This motivates us to explore autoregressive generation, which models frames sequentially and provides a more direct prior for temporal continuity. Based on this, we propose FluentAvatar, a two-stage autoregressive framework built on phoneme representations. First, Facial Keyframe Generation produces phoneme-aligned keyframes under a Phoneme-Frame Causal Attention Mask, and Inter-frame Interpolation synthesizes transition frames via a timestamp-aware adaptive strategy built upon selective state space modeling. Moreover, we introduce BG-Flicker, a background-isolated metric for talking-head videos that enables more reliable evaluation of inter-frame flicker. Experiments on CMLR and HDTF demonstrate that FluentAvatar achieves strong performance in visual fidelity, lip synchronization, and temporal stability, attaining the best FVD on both datasets and BG-Flicker results close to ground truth. The code, the model, and the interface will be released to facilitate further research.

2509.11253 2026-04-22 cs.AI

VideoAgent: Personalized Synthesis of Scientific Videos

Xiao Liang, Bangxin Li, Zixuan Chen, Hanyue Zheng, Zhi Ma, Di Wang, Cong Tian, Quan Wang

详情
英文摘要

The technical complexity of research papers often limits their reach, necessitating more accessible formats like scientific videos to disseminate key insights through engaging narration. However, existing automated methods primarily focus on static posters or slide presentations that remain template-bound and linear. Shifting to audience-adaptive video synthesis requires addressing non-linear narrative orchestration and the joint synchronization of disparate multimodal assets. We introduce VideoAgent, a modular framework that redefines scientific video synthesis as an intent-driven planning problem. By decoupling content understanding from multimodal synthesis, VideoAgent adaptively interleaves static slides with dynamic animations to match the semantic density of the narration. We further propose SciVidEval, a benchmark evaluating multimodal quality and pedagogical utility through automated metrics and human knowledge transfer studies. Extensive experiments demonstrate that VideoAgent effectively conveys complex technical logic with high narrative fidelity and communicative impact.

2508.21184 2026-04-22 cs.CL cs.AI stat.ML

BED-LLM: Intelligent Information Gathering with LLMs and Bayesian Experimental Design

Deepro Choudhury, Sinead Williamson, Adam Goliński, Ning Miao, Freddie Bickford Smith, Michael Kirchhof, Yizhe Zhang, Tom Rainforth

Comments Published at the International Conference on Learning Representations 2026

详情
英文摘要

We propose a general-purpose approach for improving the ability of large language models (LLMs) to intelligently and adaptively gather information from a user or other external source using the framework of sequential Bayesian experimental design (BED). This enables LLMs to act as effective multi-turn conversational agents and interactively interface with external environments. Our approach, which we call BED-LLM (Bayesian experimental design with large language models), is based on iteratively choosing questions or queries that maximize the expected information gain (EIG) with respect to a variable of interest given the responses gathered previously. We show how this EIG can be formulated (and then estimated) in a principled way using a probabilistic model derived from the LLM's predictive distributions and provide detailed insights into key decisions in its construction and updating procedure. We find that BED-LLM achieves substantial gains in performance across a wide range of tests based on the 20 Questions game and using the LLM to actively infer user preferences, compared to purely prompting-based design generation and other adaptive design strategies.

2508.15832 2026-04-22 cs.CL cs.AI

A Functionality-Grounded Benchmark for Evaluating Web Agents in E-commerce Domains

Xianren Zhang, Shreyas Prasad, Di Wang, Qiuhai Zeng, Suhang Wang, Wenbo Yan, Mat Hans

Comments 8 pages for main body and 8 pages of appendix

详情
英文摘要

Web agents have shown great promise in performing many tasks on ecommerce website. To assess their capabilities, several benchmarks have been introduced. However, current benchmarks in the e-commerce domain face two major problems. First, they primarily focus on product search tasks (e.g., Find an Apple Watch), failing to capture the broader range of functionalities offered by real-world e-commerce platforms such as Amazon, including account management and gift card operations. Second, existing benchmarks typically evaluate whether the agent completes the user query, but ignore the potential risks involved. In practice, web agents can make unintended changes that negatively impact the user account or status. For instance, an agent might purchase the wrong item, delete a saved address, or incorrectly configure an auto-reload setting. To address these gaps, we propose a new benchmark called Amazon-Bench. To generate user queries that cover a broad range of tasks, we propose a data generation pipeline that leverages webpage content and interactive elements (e.g., buttons, check boxes) to create diverse, functionality-grounded user queries covering tasks such as address management, wish list management, and brand store following. To improve the agent evaluation, we propose an automated evaluation framework that assesses both the performance and the safety of web agents. We systematically evaluate different agents, finding that current agents struggle with complex queries and pose safety risks. These results highlight the need for developing more robust and reliable web agents.

2508.13905 2026-04-22 cs.LG

Automated Energy-Aware Time-Series Model Deployment on Embedded FPGAs for Resilient Combined Sewer Overflow Management

Tianheng Ling, Vipin Singh, Chao Qian, Felix Biessmann, Gregor Schiele

Comments 6 pages, 6 figures, 1 table, accepted by the 11th IEEE International Smart Cities Conference (ISC2)

详情
英文摘要

Extreme weather events, intensified by climate change, increasingly challenge aging combined sewer systems, raising the risk of untreated wastewater overflow. Accurate forecasting of sewer overflow basin filling levels can provide actionable insights for early intervention, helping mitigating uncontrolled discharge. In recent years, AI-based forecasting methods have offered scalable alternatives to traditional physics-based models, but their reliance on cloud computing limits their reliability during communication outages. To address this, we propose an end-to-end forecasting framework that enables energy-efficient inference directly on edge devices. Our solution integrates lightweight Transformer and Long Short-Term Memory (LSTM) models, compressed via integer-only quantization for efficient on-device execution. Moreover, an automated hardware-aware deployment pipeline is used to search for optimal model configurations by jointly minimizing prediction error and energy consumption on an AMD Spartan-7 XC7S15 FPGA. Evaluated on real-world sewer data, the selected 8-bit Transformer model, trained on 24 hours of historical measurements, achieves high accuracy (MSE 0.0376) at an energy cost of 0.370 mJ per inference. In contrast, the optimal 8-bit LSTM model requires significantly less energy (0.009 mJ, over 40x lower) but yields 14.89% worse accuracy (MSE 0.0432) and much longer training time. This trade-off highlights the need to align model selection with deployment priorities, favoring LSTM for ultra-low energy consumption or Transformer for higher predictive accuracy. In general, our work enables local, energy-efficient forecasting, contributing to more resilient combined sewer systems. All code can be found in the GitHub Repository (https://github.com/tianheng-ling/EdgeOverflowForecast).

2508.13654 2026-04-22 cs.LG cs.AI cs.CL

Input-Time Scaling: Adding Noise and Irrelevance into Less-Is-More Drastically Improves Reasoning Performance and Efficiency

Rapheal Huang, Weilong Guo

详情
英文摘要

Large Language Models (LLMs) excel at reasoning, traditionally requiring high-quality large-scale data and extensive training. Recent works reveal a very appealing Less-Is-More phenomenon where very small, carefully curated high-quality datasets match resource-intensive approaches. In this work, we further systematically relax their quality constraints by adding controlled noise via persona context relevance and comparing datasets of different qualities. Counterintuitively, we find that mixing relevant and irrelevant contexts consistently across training and inference stages yields optimal results -- a phenomenon we term training-testing co-design. Dataset quality comparisons show that high-quality data benefits weaker models on easy questions, while low-quality data achieves higher scores on hard questions with capable models. Across our experiments, reasoning performance is linked to reasoning efficiency. We, for the first time, found adding noisy and irrelevant contexts into queries can improve reasoning efficiency without any prices and targeted designs. Building on these insights, we propose Input-Time Scaling: applying small, low-quality data to capable models with training-testing co-design. This maintains Less-Is-More while further removing labor-intensive quality curation and improving reasoning effectiveness and efficiency, making the approach more applicable and affordable. Our method achieves 76.7% pass@1 on AIME24/25 using Qwen2.5-32B-Instruct, and 90.0%/80.0% with DeepSeek-R1-Distill-Qwen-32B -- state-of-the-art among Qwen2.5-32B variants. We are open-sourcing our datasets, pipelines, evaluation results, and checkpoints to facilitate reproducibility and further research.

2508.05498 2026-04-22 cs.AI

GRAIL:Learning to Interact with Large Knowledge Graphs for Retrieval Augmented Reasoning

Ge Chang, Jinbo Su, Jiacheng Liu, Pengfei Yang, Yuhao Shang, Huiwen Zheng, Hongli Ma, Yan Liang, Yuanchun Li, Yunxin Liu

Comments This is a duplicate submission of the article "Enhancing Agentic Textual Graph Retrieval with Synthetic Stepwise Supervision" [arXiv:2510.03323]. The content is identical to the newer version. This entry is being withdrawn to avoid redundancy

详情
英文摘要

Large Language Models (LLMs) integrated with Retrieval-Augmented Generation (RAG) techniques have exhibited remarkable performance across a wide range of domains. However, existing RAG approaches primarily operate on unstructured data and demonstrate limited capability in handling structured knowledge such as knowledge graphs. Meanwhile, current graph retrieval methods fundamentally struggle to capture holistic graph structures while simultaneously facing precision control challenges that manifest as either critical information gaps or excessive redundant connections, collectively undermining reasoning performance. To address this challenge, we propose GRAIL: Graph-Retrieval Augmented Interactive Learning, a framework designed to interact with large-scale graphs for retrieval-augmented reasoning. Specifically, GRAIL integrates LLM-guided random exploration with path filtering to establish a data synthesis pipeline, where a fine-grained reasoning trajectory is automatically generated for each task. Based on the synthesized data, we then employ a two-stage training process to learn a policy that dynamically decides the optimal actions at each reasoning step. The overall objective of precision-conciseness balance in graph retrieval is decoupled into fine-grained process-supervised rewards to enhance data efficiency and training stability. In practical deployment, GRAIL adopts an interactive retrieval paradigm, enabling the model to autonomously explore graph paths while dynamically balancing retrieval breadth and precision. Extensive experiments have shown that GRAIL achieves an average accuracy improvement of 21.01% and F1 improvement of 22.43% on three knowledge graph question-answering datasets. Our source code and datasets is available at https://github.com/Changgeww/GRAIL.

2508.00161 2026-04-22 cs.LG cs.CL

Watch the Weights: Unsupervised monitoring and control of fine-tuned LLMs

Ziqian Zhong, Aditi Raghunathan

Comments Published as a conference paper at ICLR 2026

详情
英文摘要

The releases of powerful open-weight large language models (LLMs) are often not accompanied by access to their full training data. Existing interpretability methods, particularly those based on activations, often require or assume distributionally similar data. This is a significant limitation when detecting and defending against novel potential threats like backdoors, which are by definition out-of-distribution. In this work, we introduce a new method for understanding, monitoring and controlling fine-tuned LLMs that interprets weights, rather than activations, thereby sidestepping the need for data that is distributionally similar to the unknown training data. We demonstrate that the top singular vectors of the weight difference between a fine-tuned model and its base model correspond to newly acquired behaviors. By monitoring the cosine similarity of activations along these directions, we can detect salient behaviors introduced during fine-tuning with high precision. For backdoored models that bypass safety mechanisms when a secret trigger is present, our method stops up to 100% of attacks with a false positive rate below 1%. For models that have undergone unlearning, we detect inference on erased topics with accuracy up to 95.42% and can even steer the model to recover "unlearned" information. Besides monitoring, our method also shows potential for pre-deployment model auditing: by analyzing commercial instruction-tuned models (OLMo, Llama, Qwen), we are able to uncover model-specific fine-tuning focus including mathematical problem solving, emoji usage, and Midjourney prompt generation.

2507.21526 2026-04-22 cs.CL

Accelerating Prefilling via Decoding-time Contribution Sparsity

Zhiyuan He, Yike Zhang, Chengruidong Zhang, Huiqiang Jiang, Yuqing Yang, Lili Qiu

详情
英文摘要

Large Language Models (LLMs) incur quadratic attention complexity with input length, creating a major time bottleneck in the prefilling stage. Existing acceleration methods largely exploit attention score sparsity by estimating blocks with high attention scores and applying dynamic sparse attention. In this work, we identify another untapped form of sparsity in the prefilling stage, namely decoding-time contribution sparsity, where many attention blocks exhibit nontrivial attention scores during prefilling yet contribute negligibly to subsequent decoding, as indicated by gradient-based analysis. Building on this observation, we propose TriangleMix, a training-free static attention pattern that uses dense attention in a subset of layers and switches to Triangle attention in the others. Extensive experiments show that TriangleMix preserves nearly lossless performance relative to dense attention while substantially reducing attention overhead in Triangle layers. For 128K inputs, Triangle attention achieves a 15.3x speedup in attention computation, significantly exceeding the acceleration of typical dynamic sparse methods (1.9x to 3.4x). Furthermore, TriangleMix can be seamlessly combined with dynamic sparsity approaches, delivering an additional 6% to 19% reduction in TTFT over using dynamic sparsity alone. Our code is released at https://aka.ms/TriangleMix.

2507.11889 2026-04-22 cs.RO

NemeSys: Toward Online Underwater Exploration with Remote Operator-in-the-loop Adaptive Autonomy

Adnan Abdullah, Alankrit Gupta, Vaishnav Ramesh, Shivali Patel, Md Jahidul Islam

Comments 10 pages, V2

详情
英文摘要

Adaptive mission control and dynamic parameter reconfiguration are essential for autonomous underwater vehicles (AUVs) operating in GPS-denied, communication-limited marine environments. However, AUV platforms generally execute static, pre-programmed missions or rely on tethered connections and high-latency acoustic channels for mid-mission updates, significantly limiting their adaptability and responsiveness. In this paper, we introduce NemeSys, a novel AUV system designed to support real-time mission reconfiguration through compact magnetoelectric (ME) signaling. We present the full system design, control architecture, and a mission encoding framework that enables interactive exploration and task adaptation via low-bandwidth communication. The proposed system is validated through analytical modeling, controlled simulation tests, and real-world trials. The mid-mission retasking scenarios, evaluated using the NemeSys digital twin, demonstrate behavior switching latency below 50 ms with only a 13.2 MB peak computational overhead, making the framework suitable for deployment on edge computing hardware. Laboratory tank tests and open-water field trials further confirm stable control and reliable mission execution in dynamic underwater environments. These results establish the feasibility of online mission reconfiguration and highlight NemeSys as a promising step toward responsive, goal-driven adaptive underwater autonomy.

2507.06321 2026-04-22 cs.CV cs.LG

Centralized Copy-Paste: Enhanced Data Augmentation Strategy for Wildland Fire Semantic Segmentation

Joon Tai Kim, Tianle Chen, Ziyu Dong, Nishanth Kunchala, Alexander Guller, Daniel Ospina Acero, Roger Williams, Mrinal Kumar

Comments 16 pages, 5 figures; published in AIAA SciTech Forum 2026, Paper 2026-1763

详情
Journal ref
AIAA SciTech Forum, 2026, 2026-1763
英文摘要

Collecting and annotating images for the purpose of training segmentation models is often cost prohibitive. In the domain of wildland fire science, this challenge is further compounded by the scarcity of reliable public datasets with labeled ground truth. This paper presents the Centralized Copy-Paste Data Augmentation (CCPDA) method, for the purpose of assisting with the training of deep-learning multiclass segmentation models, with special focus on improving segmentation outcomes for the fire-class. CCPDA has three main steps: (i) identify fire clusters in the source image, (ii) apply a centralization technique to focus on the core of the fire area, and (iii) paste the refined fire clusters onto a target image. This method increases dataset diversity while preserving the essential characteristics of the fire class. The effectiveness of this augmentation technique is demonstrated via numerical analysis and comparison against various other augmentation methods using a weighted sum-based multi-objective optimization approach. This approach helps elevate segmentation performance metrics specific to the fire class, which carries significantly more operational significance than other classes (fuel, ash, or background). Numerical performance assessment validates the efficacy of the presented CCPDA method in alleviating the difficulties associated with small, manually labeled training datasets. It also illustrates that CCPDA outperforms other augmentation strategies in the application scenario considered, particularly in improving fire-class segmentation performance.

2507.03828 2026-04-22 cs.LG stat.ML

IMPACT: Importance-Aware Activation Space Reconstruction

Md Mokarram Chowdhury, Daniel Agyei Asante, Ernie Chang, Yang Li

Comments To appear in the Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)

详情
英文摘要

Large language models (LLMs) achieve strong performance across diverse domains but remain difficult to deploy in resource-constrained environments due to their size. Low-rank compression is a common remedy, typically minimizing weight reconstruction error under the assumption that weights are low-rank. However, this assumption often does not hold in LLMs. In contrast, LLM activations exhibit a more pronounced low-rank structure, motivating approaches that minimize activation reconstruction error. This shift alone, however, is not sufficient: different activation dimensions contribute unequally to model performance, and treating them uniformly can lead to accuracy loss. We introduce IMPACT, an importance-aware activation reconstruction framework that links compression to its effect on model performance. IMPACT formulates compression as an optimization problem that integrates activation structure with gradient-based importance, deriving a closed-form solution where reconstruction bases arise from an importance-weighted activation covariance matrix. This yields low-rank compression explicitly optimized for accuracy preservation. Experiments across multiple models and tasks demonstrate that IMPACT achieves up to 55.4% greater model size reduction while maintaining accuracy comparable to or better than state-of-the-art baselines.

2507.00451 2026-04-22 cs.LG cs.AI cs.DS cs.IT math.IT stat.ML

Best Agent Identification for General Game Playing

Matthew Stephenson, Alex Newcombe, Eric Piette, Dennis Soemers

详情
英文摘要

We present an efficient and generalised procedure to accurately identify the best (or near best) performing algorithm for each sub-task in a multi-problem domain. Our approach treats this as a set of best arm identification problems for multi-armed bandits, where each bandit corresponds to a specific task and each arm corresponds to a specific algorithm or agent. We propose an optimistic selection process based on a chosen confidence interval, that ranks each arm across all bandits in terms of their potential to influence our overall simple regret. We evaluate the performance of our approach on two of the most popular general game playing domains, the General Video Game AI (GVGAI) framework and the Ludii general game playing system, with the goal of selecting a high-performing agent for each game using a limited number of available trials. Compared to previous best arm identification algorithms for multi-armed bandits, our results demonstrate a substantial performance improvement in terms of average simple regret and average probability of error. This novel approach can be used to significantly improve the quality and accuracy of agent evaluation procedures for general game frameworks, as well as other multi-task domains with high algorithm runtimes.

2507.00439 2026-04-22 cs.CL

Improving the Distributional Alignment of LLMs using Supervision

Gauri Kambhatla, Sanjana Gautam, Angela Zhang, Alex Liu, Ravi Srinivasan, Junyi Jessy Li, Matthew Lease

Comments ACL Main 2026

详情
英文摘要

The ability to accurately align LLMs with diverse population groups on subjective questions would have great value. In this work, we show that adding simple supervision can more consistently improve the alignment of LLM-generated distributions with diverse population groups, as measured across three datasets spanning public health, public opinion, and values and beliefs. Beyond evaluating average alignment, we also report how alignment varies across specific groups. Our broad findings provide insights into the distributional alignment of LLM generations with diverse populations. By conducting evaluation over many LLMs and prompting strategies, we provide a benchmark to stimulate future research.

2506.18186 2026-04-22 cs.LG stat.ML

Online Learning of Whittle Indices for Restless Bandits with Non-Stationary Transition Kernels

Md Kamran Chowdhury Shisher, Vishrant Tripathi, Mung Chiang, Christopher G. Brinton

详情
英文摘要

The restless multi-armed bandit (RMAB) framework is a popular approach to solving resource allocation problems in networked systems. In this paper, we study optimal resource allocation in RMABs facing unknown and non-stationary dynamics. Solving RMABs optimally is known to be PSPACE-hard even with full knowledge of model parameters. While Whittle index policies offer asymptotic optimality with low computational cost, they require access to stationary transition kernels, an unrealistic assumption in many modern networking applications. To address this challenge, we propose a Sliding-Window Online Whittle (SW-Whittle) policy that remains computationally efficient while adapting to time-varying kernels. Through theoretical analysis, we show that our algorithm achieves sub-linear dynamic regret with respect to the number of episodes. We further address the important case where the variation budget is unknown in advance by combining a Bandit-over-Bandit framework with our sliding-window design. In our scheme, window lengths are tuned online as a function of the estimated variation, while Whittle indices are computed via an upper-confidence-bound of the estimated transition kernels and a bilinear optimization routine. Numerical experiments demonstrate that our algorithm consistently outperforms baselines, achieving the lowest cumulative regret across a range of non-stationary environments.