arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 1107
2510.25426 2026-05-05 cs.CL cs.AI

Implicature in Interaction: Understanding Implicature Improves Alignment in Human-LLM Interaction

Asutosh Hota, Jussi P. P. Jokinen

Comments The manuscript is approximately 7360 words and contains 12 figures and 6 tables

详情
英文摘要

The rapid advancement of Large Language Models (LLMs) is positioning language at the core of human-computer interaction (HCI). We argue that advancing HCI requires attention to the linguistic foundations of interaction, particularly implicature (meaning conveyed beyond explicit statements through shared context) which is essential for human-AI (HAI) alignment. This study examines LLMs' ability to infer user intent embedded in context-driven prompts and whether understanding implicature improves response generation. Results show that larger models approximate human interpretations more closely, while smaller models struggle with implicature inference. Furthermore, implicature-based prompts significantly enhance the perceived relevance and quality of responses across models, with notable gains in smaller models. Overall, 67.6% of participants preferred responses with implicature-embedded prompts to literal ones, highlighting a clear preference for contextually nuanced communication. Our work contributes to understanding how linguistic theory can be used to address the alignment problem by making HAI interaction more natural and contextually grounded.

2510.24814 2026-05-05 cs.CV cs.AI

Deep Feature Optimization for Enhanced Fish Freshness Assessment

Phi-Hung Hoang, Nam-Thuan Trinh, Van-Manh Tran, Thi-Thu-Hong Phan

Comments 39 pages; 10 tables; 9 figures

详情
Journal ref
Ecological Informatics, 95, 103711, 2026
英文摘要

Assessing fish freshness is vital for ensuring food safety and minimizing economic losses in the seafood industry. However, traditional sensory evaluation remains subjective, time-consuming, and inconsistent. Although recent advances in deep learning have automated visual freshness prediction, challenges related to accuracy and feature transparency persist. This study introduces a unified three-stage framework that refines and leverages deep visual representations for reliable fish freshness assessment. First, five state-of-the-art vision architectures - ResNet-50, DenseNet-121, EfficientNet-B0, ConvNeXt-Base, and Swin-Tiny - are fine-tuned to establish a strong baseline. Next, multi-level deep features extracted from these backbones are used to train seven classical machine learning classifiers, integrating deep and traditional decision mechanisms. Finally, feature selection methods based on Light Gradient Boosting Machine (LGBM), Random Forest, and Lasso identify a compact and informative subset of features. Experiments on the Freshness of the Fish Eyes (FFE) dataset demonstrate that the best configuration combining Swin-Tiny features, an Extra Trees classifier, and LGBM-based feature selection achieves an accuracy of 85.99%, outperforming recent studies on the same dataset by 8.69-22.78%. These findings confirm the effectiveness and generalizability of the proposed framework for visual quality evaluation tasks.

2510.22907 2026-05-05 cs.CL cs.AI cs.PL cs.SE

Reinforcement Learning from Compiler and Language Server Feedback

Yifan Zhang, Lanser Contributors

Comments Project Page: https://github.com/yifanzhang-pro/lanser-cli

详情
英文摘要

Coding agents fail when text-level guesses outrun program facts: they hallucinate APIs, drift to the wrong symbol, and apply edits without evidence that the workspace remains valid. Compilers, type checkers, and language servers already compute the missing supervision signal, in the form of diagnostics, symbol resolution, type information, references, and refactoring preconditions, but expose it through interfaces designed for human-driven IDEs rather than learning loops. We introduce Reinforcement Learning from Compiler and Language Server Feedback (RLCSF) together with Lanser-CLI, a CLI-first orchestration layer that exposes this signal to agents and CI. RLCSF treats each tool interaction as a transition and computes a shaped process reward from deterministic changes in diagnostics, selector confidence, and edit safety. Lanser-CLI, in turn, converts ephemeral LSP sessions into replayable Analysis Bundles with pinned environment metadata and stable content hashes. Its core mechanisms are robust selectors that go beyond file:line:col, deterministic bundle normalization, preview-first guarded mutations, and a reward functional whose potential-based component is replayable under frozen snapshots. We formalize determinism for canonical bundles and prove that componentwise-improving transitions receive non-negative reward in the undiscounted setting. Together, these pieces yield a practical substrate for process supervision of coding agents.

2510.21833 2026-05-05 cs.CV

Towards Accurate and Efficient Waste Image Classification: A Hybrid Deep Learning and Machine Learning Approach

Ngoc-Bao-Quang Nguyen, Tuan-Minh Do, Cong-Tam Phan, Thi-Thu-Hong Phan

Comments 31 pages; 7 figures; 16 tables

详情
Journal ref
Ain Shams Engineering Journal, 17(4), 104062, 2026
英文摘要

Automated image-based garbage classification is a critical component of global waste management; however, systematic benchmarks that integrate Machine Learning (ML), Deep Learning (DL), and efficient hybrid solutions remain underdeveloped. This study provides a comprehensive comparison of three paradigms: (1) machine learning algorithms using handcrafted features, (2) deep learning architectures, including ResNet variants and EfficientNetV2S, and (3) a hybrid approach that utilizes deep models for feature extraction combined with classical classifiers such as Support Vector Machine and Logistic Regression to identify the most effective strategy. Experiments on three public datasets - TrashNet, Garbage Classification, and a refined Household Garbage Dataset (with 43 corrected mislabels)- demonstrate that the hybrid method consistently outperforms the others, achieving up to 100% accuracy on TrashNet and the refined Household set, and 99.87% on Garbage Classification, thereby surpassing state-of-the-art benchmarks. Furthermore, feature selection reduces feature dimensionality by over 95% without compromising accuracy, resulting in faster training and inference. This work establishes more reliable benchmarks for waste classification and introduces an efficient hybrid framework that achieves high accuracy while reducing inference cost, making it suitable for scalable deployment in resource-constrained environments.

2510.18232 2026-05-05 cs.LG cs.CR

ACTG-ARL: Differentially Private Conditional Text Generation with RL-Boosted Control

Yuzheng Hu, Ryan McKenna, Da Yu, Shanshan Wu, Han Zhao, Zheng Xu, Peter Kairouz

Comments Accepted at ICML 2026

详情
英文摘要

Generating high-quality synthetic text under differential privacy (DP) is critical for training and evaluating language models without compromising user privacy. Prior work on synthesizing DP datasets often fail to preserve key statistical attributes, suffer utility loss from the noise required by DP, and lack fine-grained control over generation. To address these challenges, we make two contributions. First, we introduce a hierarchical framework that decomposes DP synthetic text generation into two subtasks: feature learning and conditional text generation. This design explicitly incorporates learned features into the generation process and simplifies the end-to-end synthesis task. Through systematic ablations, we identify the most effective configuration: a rich tabular schema as feature, a DP tabular synthesizer, and a DP fine-tuned conditional generator, which we term ACTG (Attribute-Conditioned Text Generation). Second, we propose Anchored RL (ARL), a post-training method that improves the instruction-following ability of ACTG for conditional generation. ARL combines RL to boost control with an SFT anchor on best-of-$N$ data to prevent reward hacking. Together, these components form our end-to-end algorithm ACTG-ARL, which advances both the quality of DP synthetic text (+20% MAUVE over prior work) and the control of the conditional generator under strong privacy guarantees.

2510.11144 2026-05-05 cs.AI cs.CL

$How^{2}$: How to learn from procedural How-to questions

Gautier Dagan, Frank Keller, Alex Lascarides

详情
英文摘要

An agent facing a planning problem can use answers to how-to questions to reduce uncertainty and fill knowledge gaps, helping it solve both current and future tasks. However, their open ended nature, where valid answers to "How do I X?" range from executable actions to high-level descriptions of X's sub-goals, makes them challenging for AI agents to ask, and for AI experts to answer, in ways that support efficient planning. We introduce $How^{2}$, a memory agent framework that enables agents to ask how-to questions, store the answers, and reuse them for lifelong learning in interactive environments. We evaluate our approach in Plancraft, a Minecraft crafting environment, where agents must complete an assembly task by manipulating inventory items. Using teacher models that answer at varying levels of abstraction, from executable action sequences to high-level subgoal descriptions, we show that lifelong learning agents benefit most from answers that are abstracted and decoupled from the current state. $How^{2}$ offers a way for LLM-based agents to improve their planning capabilities over time by asking questions in interactive environments.

2510.10785 2026-05-05 cs.SD

FAC-FACodec: Controllable Zero-Shot Foreign Accent Conversion with Factorized Speech Codec

Yurii Halychanskyi, Cameron Churchwell, Yutong Wen, Volodymyr Kindratenko

Comments 5 pages, 2 figures. Accepted to ICASSP 2026

详情
Journal ref
ICASSP 2026, Barcelona, Spain, 2026, pp. 17052-17056
英文摘要

Previous accent conversion (AC) methods, including foreign accent conversion (FAC), lack explicit control over the degree of modification. Because accent modification can alter the perceived speaker identity, balancing conversion strength and identity preservation is crucial. We present an AC framework that provides an explicit, user-controllable parameter to adjust the strength of pronunciation-level accent modification. Results show performance comparable to recent AC systems, stronger preservation of speaker identity, and unique support for controllable accent conversion.

2510.09883 2026-05-05 cs.CL cs.LG

DELTA: Dynamic Layer-Aware Token Attention for Efficient Long-Context Reasoning

Hossein Entezari Zarch, Lei Gao, Chaoyi Jiang, Murali Annavaram

Comments Accepted to Findings of ACL 2026

详情
英文摘要

Large reasoning models (LRMs) achieve state-of-the-art performance on challenging benchmarks by generating long chains of intermediate steps, but their inference cost is dominated by decoding, where each new token must attend to the entire growing sequence. One approach to reduce this latency is to evict entries from the key-value (KV) cache, thereby reducing the active context used in attention computation. However, such sparse attention methods suffer from severe accuracy degradation on reasoning tasks due to cumulative selection errors and the evolving importance of tokens over long derivations. We present \textbf{DELTA}, a training-free sparse attention mechanism that improves computational efficiency without sacrificing model accuracy. DELTA partitions transformer layers into three groups: initial layers that use full attention, a small set of \emph{$Δ$-layers} that identify salient tokens via aggregated head-level attention scores, and subsequent sparse-attention layers that attend only to the selected subset. This design preserves the full KV cache in GPU memory for accuracy, while avoiding expensive full-attention computation over many layers. On reasoning benchmarks such as AIME and GPQA-Diamond, DELTA matches or surpasses full attention in accuracy, while reducing the number of attended tokens by up to $4.25\times$ and delivering $1.54\times$ end-to-end speedup. Our results show that selective reuse of intermediate attention maps offers a robust path toward efficient long-context reasoning. The code is available at https://github.com/hoenza/DELTA.

2510.07731 2026-05-05 cs.AI cs.CL

oMeBench: Towards Robust Benchmarking of LLMs in Organic Mechanism Elucidation and Reasoning

Ruiling Xu, Yifan Zhang, Qingyun Wang, Carl Edwards, Heng Ji

Comments We need adjust some authorship

详情
英文摘要

Organic reaction mechanisms are the stepwise elementary reactions by which reactants form intermediates and products, and are fundamental to understanding chemical reactivity and designing new molecules and reactions. Although large language models (LLMs) have shown promise in understanding chemical tasks such as synthesis design, it is unclear to what extent this reflects genuine chemical reasoning capabilities, i.e., the ability to generate valid intermediates, maintain chemical consistency, and follow logically coherent multi-step pathways. We address this by introducing oMeBench, the first large-scale, expert-curated benchmark for organic mechanism reasoning in organic chemistry. It comprises over 10,000 annotated mechanistic steps with intermediates, type labels, and difficulty ratings. Furthermore, to evaluate LLM capability more precisely and enable fine-grained scoring, we propose oMeS, a dynamic evaluation framework that combines step-level logic and chemical similarity. We analyze the performance of state-of-the-art LLMs, and our results show that although current models display promising chemical intuition, they struggle with correct and consistent multi-step reasoning. Notably, we find that using prompting strategy and fine-tuning a specialist model on our proposed dataset increases performance by 50% over the leading closed-source model. We hope that oMeBench will serve as a rigorous foundation for advancing AI systems toward genuine chemical reasoning.

2510.03599 2026-05-05 cs.RO

Learning to Act Through Contact: A Unified View of Multi-Task Robot Learning

Shafeef Omar, Majid Khadiv

详情
英文摘要

We present a unified framework for multi-task locomotion and manipulation policy learning grounded in a contact-explicit representation. Instead of designing different policies for different tasks, our approach unifies the definition of a task through a sequence of contact goals--desired contact positions, timings, and active end-effectors. This enables leveraging the shared structure across diverse contact-rich tasks, leading to a single policy that can perform a wide range of tasks. In particular, we train a goal-conditioned reinforcement learning (RL) policy to realise given contact plans. We validate our framework on multiple robotic embodiments and tasks: a quadruped performing multiple gaits, a humanoid performing multiple biped and quadrupedal gaits, and a humanoid executing different bimanual object manipulation tasks. Each of these scenarios is controlled by a single policy trained to execute different tasks grounded in contacts, demonstrating versatile and robust behaviours across morphologically distinct systems. Our results show that explicit contact reasoning significantly improves generalisation to unseen scenarios, positioning contact-explicit policy learning as a promising foundation for scalable loco-manipulation. Video available at: https://youtu.be/idHx67oHHU0?si=qZJ7C0ujemXNWgA5

2510.02945 2026-05-05 cs.LG cs.AI

Ergodic Risk Measures: Towards a Risk-Aware Foundation for Continual Reinforcement Learning

Juan Sebastian Rojas, Chi-Guhn Lee

详情
英文摘要

Continual reinforcement learning (continual RL) seeks to formalize the notions of lifelong learning and endless adaptation in RL. In particular, the aim of continual RL is to develop RL agents that can maintain a careful balance between retaining useful information and adapting to new situations. To date, continual RL has been explored almost exclusively through the lens of risk-neutral decision-making, in which the agent aims to optimize the expected long-run performance. In this work, we present the first formal theoretical treatment of continual RL through the lens of risk-aware decision-making, in which the behaviour of the agent is directed towards optimizing a measure of long-run performance beyond the mean. In particular, we show that the classical theory of risk measures, widely used as a theoretical foundation in non-continual risk-aware RL, is, in its current form, incompatible with continual learning. Then, building on this insight, we extend risk measure theory into the continual setting by introducing a new class of ergodic risk measures, and showing that it is compatible with continual learning. Finally, we provide a case study of continual risk-aware learning, along with empirical results, which show the intuitive appeal of ergodic risk measures in continual settings.

2510.01020 2026-05-05 cs.LG cs.AI math.ST stat.ML stat.TH

The Good, the Bad, and the Sampled: a No-Regret Approach to Safe Online Classification

Tavor Z. Baharav, Spyros Dragazis, Aldo Pacchiano

Comments 38 pages, accepted to AISTATS 2026

详情
英文摘要

We study sequential testing for a binary disease outcome when risk follows an unknown logistic model. At each round, the decision maker may either pay for a test revealing the true label or predict the outcome based on patient features and past data. The goal is to minimize costly tests while ensuring the misclassification rate stays below $α$ with probability at least $1-δ$. We propose a method that jointly estimates the logistic parameter $θ^{\star}$ and the feature distribution, using a conservative threshold on the logistic score to decide when to test. We prove our procedure achieves the target error with high probability and requires only $\widetilde O(\sqrt{T})$ more tests than an oracle with full knowledge. This is the first no-regret guarantee for error-constrained logistic testing, with direct applications to medical screening. Simulations corroborate our theoretical results, showing safe classification of patients and efficient estimation of $θ^{\star}$ with few excess tests.

2509.25020 2026-05-05 cs.LG

Deep Thinking by Markov Chain of Continuous Thoughts

Jiayu Liu, Zhenya Huang, Xuan Yang, Tianyun Ji, Anya Sims, Hao Xu, Enhong Chen, Yee Whye Teh, Ning Miao

详情
英文摘要

Transformer-based models can perform complicated reasoning by generating reasoning paths token by token. While effective, this approach often requires generating thousands of tokens to solve a single problem, which can be slow and computationally expensive. More importantly, it involves a discrete sampling operation at the end of each time step, creating an information bottleneck across time steps. In this work, we propose MarCos, an improvement of the transformer structure that allows fully continuous reasoning at the thought level. Unlike traditional transformer layers, which focus on refining token predictions at each time step, layers in MarCos map a continuous representation of a stepwise thought to the distribution of the next thought. This enables us to achieve multi-step reasoning in a single pass of MarCos. Preliminary experimental results on synthetic and real-world math tasks show the great potential of MarCos. Notably, we observe that the increased information bandwidth of MarCos elicits the ability of parallel thinking, in contrast to single-threaded thinking in traditional transformers. Meanwhile, in real-world math tasks, MarCos achieves more than $10\times$ speedup in wall-clock time with the same level of accuracy. Our code is available at https://github.com/Ljyustc/MarCos.

2509.22075 2026-05-05 cs.CL cs.AI

CoSpaDi: Compressing LLMs via Calibration-Guided Sparse Dictionary Learning

Denis Makhov, Dmitriy Shopkhoev, Magauiya Zhussip, Ammar Ali, Stamatios Lefkimmiatis

详情
英文摘要

Post-training compression of large language models (LLMs) often relies on low-rank weight approximations that represent each column of the weight matrix in a shared low-dimensional subspace. This strategy is computationally efficient but the underlying constraint can be overly rigid for heterogeneous projection weights and may incur avoidable accuracy loss. We propose CoSpaDi (Compression via Sparse Dictionary Learning), a training-free framework that replaces low-rank factorization with a structured sparse decomposition in which each weight matrix is represented as a dense dictionary multiplied by a column-sparse coefficient matrix. This yields a union-of-subspaces model: the columns of the weight matrix are represented as linear combinations of different subsets of dictionary atoms, improving expressiveness at a fixed parameter budget. CoSpaDi is calibration-guided: using a small calibration set, we optimize the factorization to minimize functional reconstruction error of layer outputs rather than weight-space error. An activation-derived Gram orthonormalization reformulates this data-aware objective into a standard dictionary learning problem on transformed weights, and we support both per-layer compression and cross-layer dictionary sharing within groups of similar projections. Across Llama and Qwen model families, CoSpaDi consistently improves the accuracy-compression and perplexity-compression trade-offs over state-of-the-art SVD-based baselines and strong structured pruning baselines at 20-40% compression ratios. The resulting structured sparsity enables sparse--dense computation and integrates with post-training quantization of the sparse coefficients. Code is accessible at https://github.com/mts-ai/CoSpaDi

2509.17677 2026-05-05 cs.AI

EngiBench: A Benchmark for Evaluating Large Language Models on Engineering Problem Solving

Xiyuan Zhou, Xinlei Wang, Yirui He, Yang Wu, Ruixi Zou, Yuheng Cheng, Yulu Xie, Wenxuan Liu, Huan Zhao, Yan Xu, Jinjin Gu, Junhua Zhao

Comments Accepted at ACL 2026 Findings

详情
英文摘要

Large language models (LLMs) have shown strong performance on mathematical reasoning under well-defined conditions. However, real-world engineering problems involve uncertainty, context, and open-ended settings that extend beyond symbolic computation. Existing benchmarks largely focus on well-defined or abstract reasoning and therefore fail to capture these complexities. We introduce EngiBench, a hierarchical benchmark designed to evaluate LLMs on solving engineering problems. It spans three levels of increasing difficulty (foundational knowledge retrieval, contextual reasoning, and open-ended modeling) and covers diverse engineering subfields. To facilitate a deeper understanding of model performance, we systematically rewrite each problem into three controlled variants (perturbed, knowledge-enhanced, and math abstraction), enabling us to separately evaluate the model's robustness, domain-specific knowledge, and mathematical reasoning abilities. Experimental results show clear performance stratification across difficulty levels: model accuracy declines with task complexity, degrades under minor perturbations, and remains substantially below human performance on high-level engineering tasks. These findings reveal that current LLMs still lack the high-level reasoning needed for real-world engineering, highlighting the need for future models with deeper and more reliable problem-solving capabilities. Our source code and data are available at https://github.com/AI4Engi/EngiBench.

2509.12464 2026-05-05 cs.AI

Reasoning Models Can be Accurately Pruned Via Chain-of-Thought Reconstruction

Ryan Lucas, Kayhan Behdin, Zhipeng Wang, Qingquan Song, Shao Tang, Rahul Mazumder

详情
英文摘要

Reasoning language models such as DeepSeek-R1 produce long chain-of-thought traces during inference time which make them costly to deploy at scale. We show that using compression techniques such as neural network pruning produces greater performance loss than in typical language modeling tasks, and in some cases can make the model slower since they cause the model to produce more thinking tokens but with worse performance. We show that this is partly due to the fact that standard LLM pruning methods often focus on input reconstruction, whereas reasoning is a decode-dominated task. We introduce a simple, drop-in fix: during pruning we jointly reconstruct activations from the input and the model's on-policy chain-of-thought traces. This "Reasoning-Aware Compression" (RAC) integrates seamlessly into existing pruning workflows such as SparseGPT, and boosts their performance significantly. Code reproducing the results in the paper can be found at: https://github.com/RyanLucas3/RAC

2509.09723 2026-05-05 cs.CL cs.AI cs.LG stat.ME

ALIGNS: Unlocking nomological networks in psychological measurement through a large language model

Kai R. Larsen, Sen Yan, Roland M. Mueller, Lan Sang, Mikko Rönkkö, Ravi Starzl, Donald Edmondson

Comments Error in algorithm explanation

详情
英文摘要

Psychological measurement is critical to many disciplines. Despite advances in measurement, building nomological networks, theoretical maps of how concepts and measures relate to establish validity, remains a challenge 70 years after Cronbach and Meehl proposed them as fundamental to validation. This limitation has practical consequences: clinical trials may fail to detect treatment effects, and public policy may target the wrong outcomes. We introduce Analysis of Latent Indicators to Generate Nomological Structures (ALIGNS), a large language model-based system trained with validated questionnaire measures. ALIGNS provides three comprehensive nomological networks containing over 550,000 indicators across psychology, medicine, social policy, and other fields. This represents the first application of large language models to solve a foundational problem in measurement validation. We report classification accuracy tests used to develop the model, as well as three evaluations. In the first evaluation, the widely used NIH PROMIS anxiety and depression instruments are shown to converge into a single dimension of emotional distress. The second evaluation examines child temperament measures and identifies four potential dimensions not captured by current frameworks, and questions one existing dimension. The third evaluation, an applicability check, engages expert psychometricians who assess the system's importance, accessibility, and suitability. ALIGNS is freely available at nomologicalnetwork.org, complementing traditional validation methods with large-scale nomological analysis.

2508.19227 2026-05-05 cs.CL cs.AI cs.HC

Generative Interfaces for Language Models

Jiaqi Chen, Yanzhe Zhang, Yutong Zhang, Yijia Shao, Diyi Yang

Comments ACL 2026 Findings

详情
英文摘要

Large language models (LLMs) are increasingly seen as assistants, copilots, and consultants, capable of supporting a wide range of tasks through natural conversation. However, most systems remain constrained by a linear request-response format that often makes interactions inefficient in multi-turn, information-dense, and exploratory tasks. To address these limitations, we propose Generative Interfaces for Language Models, a paradigm in which LLMs respond to user queries by proactively generating user interfaces (UIs) that enable more adaptive and interactive engagement. Our framework leverages structured interface-specific representations and iterative refinements to translate user queries into task-specific UIs. For systematic evaluation, we introduce a multidimensional assessment framework that compares generative interfaces with traditional chat-based ones across diverse tasks, interaction patterns, and query types, capturing functional, interactive, and emotional aspects of user experience. Results show that generative interfaces consistently outperform conversational ones, with up to a 72% improvement in human preference. These findings clarify when and why users favor generative interfaces, paving the way for future advancements in human-AI interaction. Data and code are available at https://github.com/SALT-NLP/GenUI.

2508.15658 2026-05-05 cs.CL cs.AI cs.IR

SurGE: A Benchmark and Evaluation Framework for Scientific Survey Generation

Weihang Su, Anzhe Xie, Qingyao Ai, Jianming Long, Xuanyi Chen, Jiaxin Mao, Ziyi Ye, Yiqun Liu

详情
英文摘要

The rapid growth of academic literature makes the manual creation of scientific surveys increasingly infeasible. While large language models show promise for automating this process, progress in this area is hindered by the absence of standardized benchmarks and evaluation protocols. To bridge this critical gap, we introduce SurGE (Survey Generation Evaluation), a new benchmark for scientific survey generation in computer science. SurGE consists of (1) a collection of test instances, each including a topic description, an expert-written survey, and its full set of cited references, and (2) a large-scale academic corpus of over one million papers. In addition, we propose an automated evaluation framework that measures the quality of generated surveys across four dimensions: comprehensiveness, citation accuracy, structural organization, and content quality. Our evaluation of diverse LLM-based methods demonstrates a significant performance gap, revealing that even advanced agentic frameworks struggle with the complexities of survey generation and highlighting the need for future research in this area. We have open-sourced all the code, data, and models at: https://github.com/oneal2000/SurGE

2508.12778 2026-05-05 cs.CL

HeteroRAG: A Heterogeneous Retrieval-Augmented Generation Framework for Medical Vision Language Tasks

Zhe Chen, Yusheng Liao, Zhiyuan Zhu, Haolin Li, Hongcheng Liu, Yanfeng Wang, Yu Wang

Comments ACL 2026 Findings

详情
英文摘要

Medical large vision-language Models (Med-LVLMs) have shown promise in clinical applications but suffer from factual inaccuracies and unreliable outputs, posing risks in real-world diagnostics. While RAG has emerged as a potential solution, current medical multimodal RAG systems are unable to perform effective retrieval across heterogeneous sources. The irrelevance of retrieved reports undermines the factuality of analysis, while insufficient knowledge affects the credibility of clinical decision-making. To bridge the research gap, we construct MedAtlas, which includes extensive multimodal report repositories and diverse text corpora. Based on it, we present HeteroRAG, a novel framework that enhances Med-LVLMs through heterogeneous knowledge sources. The framework introduces Modality-specific CLIPs for effective report retrieval and a Multi-corpora Query Generator for tailoring queries to diverse corpora. Incorporating knowledge from such multifaceted sources, Heterogeneous Knowledge Preference Tuning is performed to achieve cross-modality and multi-source knowledge alignment. Extensive experiments across 11 datasets and 3 modalities demonstrate that HeteroRAG achieves state-of-the-art performance in most medical vision language benchmarks, significantly improving factual accuracy and reliability of Med-LVLMs.

2508.11205 2026-05-05 cs.LG

Meta-learning Structure-Preserving Dynamics

Cheng Jing, Uvini Balasuriya Mudiyanselage, Woojin Cho, Minju Jo, Anthony Gruber, Kookjin Lee

Comments Accepted to ICML 2026; full camera-ready version will be updated later

详情
英文摘要

Structure-preserving approaches to dynamics discovery have demonstrated great potential for modeling physical systems due to their use of strong inductive biases, which enforce key features such as conservation laws and dissipative behavior. However, these models are typically trained on a per-configuration basis, requiring explicit knowledge of system parameters and costly retraining when these parameters vary. While meta-learning provides a potential remedy, optimization-based approaches can suffer from limited generalizability. Motivated by recent advances in modulation-based learning aimed at mitigating these drawbacks, we systematically investigate the use of modulation techniques in learning conservative dynamical systems. We study a range of existing modulation strategies alongside newly proposed variants, integrating them into a Hamiltonian learning framework without requiring an explicit system parameterization. Through extensive experiments on benchmark problems, we demonstrate that modulation-based meta-learning enables accurate few-shot adaptation, achieving robust generalization across parameter space without compromising the conservation of key invariants responsible for the dynamics.

2508.02046 2026-05-05 cs.RO cs.LG

NaviMaster: Learning a Unified Policy for GUI and Embodied Navigation Tasks

Zhihao Luo, Wentao Yan, Jingyu Gong, Min Wang, Zhizhong Zhang, Xuhong Wang, Yuan Xie, Xin Tan

Comments ACL 2026 Main Camera Ready

详情
英文摘要

Recent advances in Graphical User Interface (GUI) and embodied navigation have driven progress, yet these domains have largely evolved in isolation, with disparate datasets and training paradigms. In this paper, we observe that both tasks can be formulated as Markov Decision Processes (MDP), suggesting a foundational principle for their unification. Hence, we present NaviMaster, the first unified agent capable of unifying GUI navigation and embodied navigation within a single framework. Specifically, NaviMaster (i) proposes a visual-target trajectory collection pipeline that generates trajectories for both GUI and embodied tasks using a single formulation. (ii) employs a unified reinforcement learning framework on the mix data to improve generalization. (iii) designs a novel distance-aware reward to ensure efficient learning from the trajectories. Through extensive experiments on out-of-domain benchmarks, NaviMaster is shown to outperform state-of-the-art agents in GUI navigation, spatial affordance prediction, and embodied navigation. Ablation studies further demonstrate the efficacy of our unified training strategy, data mixing strategy, and reward design. Our codes, data, and checkpoints are available at https://iron-boyy.github.io/navimaster-page/.

2507.23386 2026-05-05 cs.CL cs.AI

Causal2Vec: Improving Decoder-only LLMs as Embedding Models through a Contextual Token

Ailiang Lin, Zhuoyun Li, Yusong Wang, Kotaro Funakoshi, Manabu Okumura

详情
英文摘要

Decoder-only large language models (LLMs) have been increasingly adopted to build embedding models for diverse tasks. To overcome the inherent limitations of causal attention in representation learning, many existing methods modify the attention mechanism to be bidirectional, potentially undermining LLMs' ability to extract semantic information acquired during pre-training. Meanwhile, leading unidirectional approaches often rely on extra input text to generate contextualized embeddings, inevitably increasing computational costs. In this work, we propose Causal2Vec, a general-purpose embedding model tailored to enhance the performance of decoder-only LLMs without altering their original architectures or introducing significant computational overhead. Specifically, we first employ a lightweight BERT-style model to pre-encode the input text into a single Contextual token, which is then prepended to the LLM's input sequence, allowing each token to capture contextualized information even without attending to future tokens. Furthermore, to mitigate the recency bias introduced by last-token pooling, we concatenate the last hidden states of Contextual and EOS tokens as the final text embedding. In practice, Causal2Vec achieves a new state-of-the-art performance on the MTEB benchmark among models trained solely on publicly available retrieval datasets.

2507.21153 2026-05-05 cs.LG cs.AI cs.SY eess.SY

Green Energy Management for Sustainable Data Centers Using Deep Reinforcement Learning

Abderaouf Bahi, Amel Ourici, Hasan Dincer, Serhat Yuksel, Akila Djebbar

详情
英文摘要

The exponential growth of digital services has positioned data centers among the most energy-intensive infrastructures in the modern economy, raising critical concerns regarding operational costs, carbon emissions, and the sustainable integration of renewable energy sources. This paper proposes a novel Deep Reinforcement Learning (DRL)-based energy management framework for data centers, designed to dynamically coordinate solar photovoltaic generation, wind power, battery storage systems, and conventional grid electricity under highly stochastic operational conditions. The proposed framework formulates the energy management problem as a Markov Decision Process and employs a Proximal Policy Optimization (PPO) agent augmented with a hybrid Long Short-Term Memory and temporal attention architecture, enabling accurate modeling of workload dynamics and renewable generation variability. A multi-objective reward function jointly minimizes energy costs, carbon emissions, and service-level agreement (SLA) violations while promoting efficient storage utilization. Extensive experiments conducted on three datasets demonstrate that the proposed framework achieves a 38\% reduction in energy costs compared to rule-based heuristics and outperforms the strongest DRL baseline by 4.6\%, while maintaining an SLA violation rate as low as 1.5\% and an energy efficiency of 83.7\%. Ablation studies confirm the individual contribution of each architectural component, and hyperparameter sensitivity analysis validates the robustness of the approach across a range of configurations.

2507.18713 2026-05-05 cs.CV cs.RO

SaLF: Sparse Local Fields for Multi-Sensor Rendering in Real-Time

Yun Chen, Matthew Haines, Jingkang Wang, Sahil Jain, Krzysztof Baron-Lis, Sivabalan Manivasagam, Ze Yang, Raquel Urtasun

Comments ICRA 2026. Project page: https://waabi.ai/salf/

详情
英文摘要

High-fidelity sensor simulation of light-based sensors such as cameras and LiDARs is critical for safe and accurate autonomy testing. Neural radiance field (NeRF)-based methods that reconstruct sensor observations via ray-casting of implicit representations have demonstrated accurate simulation of driving scenes, but are slow to train and render, hampering scalability. 3D Gaussian Splatting (3DGS) has demonstrated faster training and rendering times through rasterization, but is primarily restricted to pinhole camera sensors, preventing usage for realistic multi-sensor autonomy evaluation. Moreover, both NeRF and 3DGS couple the representation with the rendering procedure (implicit networks for ray-based evaluation, particles for rasterization), preventing interoperability, which is key for general usage. In this work, we present Sparse Local Fields (SaLF), a novel volumetric representation that supports rasterization and raytracing for unified multi-sensor simulation. SaLF represents volumes as a sparse set of 3D voxel primitives, where each voxel is a local implicit field. SaLF has fast training ($<$30 min) and rendering capabilities (50+ FPS for camera and 600+ FPS for LiDAR), has adaptive pruning and densification to easily handle large scenes, and can support non-pinhole cameras and spinning LiDARs. We demonstrate that SaLF has similar realism as existing self-driving sensor simulation methods while improving efficiency and enhancing capabilities, enabling more scalable simulation.

2507.17777 2026-05-05 cs.AI

ASP-Assisted Symbolic Regression: Uncovering Hidden Physics in Fluid Mechanics

Theofanis Aravanis, Grigorios Chrimatopoulos, Mohammad Ferdows, Michalis Xenos, Efstratios Em Tzirtzilakis

Comments This research was implemented in the framework of the Action "Flagship actions in interdisciplinary scientific fields with a special focus on the productive fabric'', which is implemented through the National Recovery and Resilience Fund Greece 2.0 and funded by the European Union--NextGenerationEU (Project ID: TAEDR-0535983). Sci Rep (2026)

详情
英文摘要

Symbolic Regression (SR) offers an interpretable alternative to conventional Machine-Learning (ML) approaches, which are often criticized as ``black boxes''. In contrast to standard regression models that require a prescribed functional form, SR constructs expressions from a user-defined set of mathematical primitives, enabling the automated discovery of compact formulas that fit the data and reveal underlying physical relationships. In fluid mechanics, where understanding the underlying physics is as crucial as predictive accuracy, this study applies SR to model three-dimensional (3D) laminar flow in a rectangular channel, focusing on the axial velocity and pressure fields. Compact symbolic equations were derived from numerical simulation data, accurately reproducing the expected parabolic velocity profile and linear pressure drop, and showing excellent agreement with analytical solutions from the literature. To address the limitation that purely data-driven SR models may overlook domain-specific constraints, an innovative hybrid framework that integrates SR with Answer Set Programming (ASP) is also introduced. This integration combines the generative power of SR with the declarative reasoning capabilities of ASP, ensuring that derived equations remain both statistically accurate and physically plausible. The proposed SR/ASP methodology demonstrates the potential of combining data-driven and knowledge-representation approaches to enhance interpretability, reliability, and alignment with physical principles in fluid dynamics and related domains.

2507.13563 2026-05-05 cs.CL cs.SD eess.AS

Balalaika: Data-Centric, Prosody-Aware Annotation Pipeline for Russian Speech

Kirill Borodin, Nikita Vasiliev, Vasiliy Kudryavtsev, Maxim Maslov, Mikhail Gorodnichev, Grach Mkrtchian

Comments The work is still in progress. Submitted to Interspeech 2026

详情
英文摘要

We introduce Balalaika, an open-source, data-centric pipeline for processing audio and producing prosody-aware annotations. It combines semantic VAD for context-preserving segmentation, multi-ASR ensembling with ROVER consensus decoding, while retaining optional word-level timestamps, followed by automatic quality and speaker-purity filtering. The text is further enriched with punctuation restoration, lexical stress and "\textipa{e}/\textipa{He}" normalization, and IPA phonemes. Using Balalaika, we build a 5.1k-hour multi-source Russian corpus with rich annotations, and show consistent gains under equalized training budgets for both speech denoising and TTS; ablations confirm complementary benefits of stress and punctuation and improved synthesis with stricter MOS filtering. The datasets are publicly available at \href{https://huggingface.co/collections/lab260/balalaika-dataset}{\underline{\textbf{HuggingFace}}}

2507.07129 2026-05-05 cs.LG cs.CL

Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate

A. Bochkov

Comments Limitations added

详情
英文摘要

We study a constrained training regime for decoder-only Transformers in which the token interface is fixed, previously trained dense blocks are not reopened, and the active trainable parameter set is kept approximately constant as depth grows. Starting from a shallow model, we stack new blocks and train only the newest blocks and the LM head; optional LoRA phases provide limited global readjustment under the same active-parameter budget. The paper asks a feasibility/tradeoff question, not whether this regime matches tuned monolithic pretraining. In a common-protocol 9-layer study on a frozen Unicode substrate, the constructive frozen-Unicode model uses 105.0M active trainable parameters, compared with 180.5M for the interface-matched monolithic frozen baseline and 247.6M for the fully trainable monolithic baseline. We then consider an extreme fixed interface: each token is represented only by a frozen 16-dim binary token-ID code, deterministically lifted to d_model, so the resulting token embedding matrix has rank at most 16. Even in this setting, continued growth remains viable. In a 68.9B-token run on FineWeb-Edu + Cosmopedia, a 16-layer 269.7M model trained above this fixed interface reaches 28.92\% MMLU after an interleaved LoRA stage. Reported final metrics are measured after merging the last-stage LoRA adapters into the 269.7M base model. Because the data mixture changes across stages in this long-horizon run, we interpret it as a viability demonstration rather than a clean causal comparison. Overall, the evidence supports a narrow claim: useful continued learning can proceed above a frozen minimal interface under a bounded active trainable-parameter budget, with a clear tradeoff against dense monolithic training in final perplexity.

2506.22899 2026-05-05 cs.CV cs.GR cs.LG cs.MA eess.IV

Neural Cellular Automata: From Cells to Pixels

Ehsan Pajouheshgar, Yitao Xu, Ali Abbasi, Alexander Mordvintsev, Wenzel Jakob, Sabine Süsstrunk

Comments 9 pages, 14 figures, +8 pages of Appendix (20 figures in total)

详情
Journal ref
SIGGRAPH 2026
英文摘要

Neural Cellular Automata (NCAs) are bio-inspired dynamical systems in which identical cells iteratively apply a learned local update rule to self-organize into complex patterns, exhibiting regeneration, robustness, and spontaneous dynamics. Despite their success in texture synthesis and morphogenesis, NCAs remain largely confined to low-resolution outputs. This limitation stems from (1) training time and memory requirements that grow quadratically with grid size, (2) the strictly local propagation of information that impedes long-range cell communication, and (3) the heavy compute demands of real-time inference at high resolution. In this work, we overcome this limitation by pairing an NCA that evolves on a coarse grid with a lightweight implicit decoder that maps cell states and local coordinates to appearance attributes, enabling the same model to render outputs at arbitrary resolution. Moreover, because both the decoder and NCA updates are local, inference remains highly parallelizable. To supervise high-resolution outputs efficiently, we introduce task-specific losses for morphogenesis (growth from a seed) and texture synthesis with minimal additional memory and computation overhead. Our experiments across 2D/3D grids and mesh domains demonstrate that our hybrid models produce high-resolution outputs in real-time, and preserve the characteristic self-organizing behavior of NCAs.

2506.10920 2026-05-05 cs.CL cs.LG

Constructing Interpretable Features from Compositional Neuron Groups

Or Shafran, Atticus Geiger, Mor Geva

Comments Accepted at ACL 2026 main conference

详情
英文摘要

A central goal for mechanistic interpretability has been to identify the right units of analysis in large language models (LLMs) that causally explain their outputs. While early work focused on individual neurons, evidence that neurons often encode multiple concepts has motivated a shift toward analyzing directions in activation space. A key question is how to find directions that capture interpretable features in an unsupervised manner. Current methods rely on dictionary learning with sparse autoencoders (SAEs), commonly trained over residual stream activations to learn directions from scratch. However, SAEs often struggle in causal evaluations and lack intrinsic interpretability, as their learning is not explicitly tied to the computations of the model. Here, we tackle these limitations by directly decomposing MLP activations with semi-nonnegative matrix factorization (SNMF), such that the learned features are (a) sparse linear combinations of co-activated neurons, and (b) mapped to their activating inputs, making them directly interpretable. Experiments on Llama 3.1, Gemma 2 and GPT-2 show that SNMF derived features outperform SAEs and a strong supervised baseline (difference-in-means) on causal steering, while aligning with human-interpretable concepts. Further analysis reveals that specific neuron combinations are reused across semantically-related features, exposing a hierarchical structure in the MLP's activation space. Together, these results position SNMF as a simple and effective tool for identifying interpretable features and dissecting concept representations in LLMs.