arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 3191
2604.05663 2026-04-14 cs.AI

CuraLight: Debate-Guided Data Curation for LLM-Centered Traffic Signal Control

Qing Guo, Xinhang Li, Junyu Chen, Zheng Guo, Shengzhe Xu, Lin Zhang, Lei Li

Comments accepted at IJCNN 2026

详情
英文摘要

Traffic signal control (TSC) is a core component of intelligent transportation systems (ITS), aiming to reduce congestion, emissions, and travel time. Recent approaches based on reinforcement learning (RL) and large language models (LLMs) have improved adaptivity, but still suffer from limited interpretability, insufficient interaction data, and weak generalization to heterogeneous intersections. This paper proposes CuraLight, an LLM-centered framework where an RL agent assists the fine-tuning of an LLM-based traffic signal controller. The RL agent explores traffic environments and generates high-quality interaction trajectories, which are converted into prompt-response pairs for imitation fine-tuning. A multi-LLM ensemble deliberation system further evaluates candidate signal timing actions through structured debate, providing preference-aware supervision signals for training. Experiments conducted in SUMO across heterogeneous real-world networks from Jinan, Hangzhou, and Yizhuang demonstrate that CuraLight consistently outperforms state-of-the-art baselines, reducing average travel time by 5.34 percent, average queue length by 5.14 percent, and average waiting time by 7.02 percent. The results highlight the effectiveness of combining RL-assisted exploration with deliberation-based data curation for scalable and interpretable traffic signal control.

2604.05165 2026-04-14 cs.AI eess.SP

Learning to Focus: CSI-Free Hierarchical MARL for Reconfigurable Reflectors

Hieu Le, Mostafa Ibrahim, Oguz Bedir, Jian Tao, Sabit Ekin

详情
英文摘要

Reconfigurable Intelligent Surfaces (RIS) has a potential to engineer smart radio environments for next-generation millimeter-wave (mmWave) networks. However, the prohibitive computational overhead of Channel State Information (CSI) estimation and the dimensionality explosion inherent in centralized optimization severely hinder practical large-scale deployments. To overcome these bottlenecks, we introduce a ``CSI-free" paradigm powered by a Hierarchical Multi-Agent Reinforcement Learning (HMARL) architecture to control mechanically reconfigurable reflective surfaces. By substituting pilot-based channel estimation with accessible user localization data, our framework leverages spatial intelligence for macro-scale wave propagation management. The control problem is decomposed into a two-tier neural architecture: a high-level controller executes temporally extended, discrete user-to-reflector allocations, while low-level controllers autonomously optimize continuous focal points utilizing Multi-Agent Proximal Policy Optimization (MAPPO) under a Centralized Training with Decentralized Execution (CTDE) scheme. Comprehensive deterministic ray-tracing evaluations demonstrate that this hierarchical framework achieves massive RSSI improvements of up to 7.79 dB over centralized baselines. Furthermore, the system exhibits robust multi-user scalability and maintains highly resilient beam-focusing performance under practical sub-meter localization tracking errors. By eliminating CSI overhead while maintaining high-fidelity signal redirection, this work establishes a scalable and cost-effective blueprint for intelligent wireless environments.

2604.03765 2026-04-14 cs.CV

ITIScore: An Image-to-Text-to-Image Rating Framework for the Image Captioning Ability of MLLMs

Zitong Xu, Huiyu Duan, Shengyao Qin, Guangyu Yang, Guangji Ma, Xiongkuo Min, Ke Gu, Guangtao Zhai, Patrick Le Callet

详情
英文摘要

Recent advances in multimodal large language models (MLLMs) have greatly improved image understanding and captioning capabilities. However, existing image captioning benchmarks typically suffer from limited diversity in caption length, the absence of recent advanced MLLMs, and insufficient human annotations, which potentially introduces bias and limits the ability to comprehensively assess the performance of modern MLLMs. To address these limitations, we present a new large-scale image captioning benchmark, termed, ICBench, which covers 12 content categories and consists of both short and long captions generated by 10 advanced MLLMs on 2K images, resulting in 40K captions in total. We conduct extensive human subjective studies to obtain mean opinion scores (MOSs) across fine-grained evaluation dimensions, where short captions are assessed in terms of fluency, relevance, and conciseness, while long captions are evaluated based on fluency, relevance, and completeness. Furthermore, we propose an automated evaluation metric, \textbf{ITIScore}, based on an image-to-text-to-image framework, which measures caption quality through reconstruction consistency. Experimental results demonstrate strong alignment between our automatic metric and human judgments, as well as robust zero-shot generalization ability on other public captioning datasets. Both the dataset and model will be released upon publication.

2604.02927 2026-04-14 cs.LG cs.NI

Towards Near-Real-Time Telemetry-Aware Routing with Neural Routing Algorithms

Andreas Boltres, Niklas Freymuth, Benjamin Schichtholz, Michael König, Gerhard Neumann

详情
英文摘要

Routing algorithms are crucial for efficient computer network operations, and in many settings they must be able to react to traffic bursts within milliseconds. Live telemetry data can provide informative signals to routing algorithms, and recent work has trained neural networks to exploit such signals for traffic-aware routing. Yet, aggregating network-wide information is subject to communication delays, and existing neural approaches either assume unrealistic delay-free global states, or restrict routers to purely local telemetry. This leaves their deployability in real-world environments unclear. We cast telemetry-aware routing as a delay-aware closed-loop control problem and introduce a framework that trains and evaluates neural routing algorithms, while explicitly modeling communication and inference delays. On top of this framework, we propose LOGGIA, a scalable graph neural routing algorithm that predicts log-space link weights from attributed topology-and-telemetry graphs. It utilizes a data-driven pre-training stage, followed by on-policy Reinforcement Learning. Across synthetic and real network topologies, and unseen mixed TCP/UDP traffic sequences, LOGGIA consistently outperforms shortest-path baselines, whereas neural baselines fail once realistic delays are enforced. Our experiments further suggest that neural routing algorithms like LOGGIA perform best when deployed fully locally, i.e., observing network states and inferring actions at every router individually, as opposed to centralized decision making.

2603.28287 2026-04-14 cs.CV

TerraSky3D: Multi-View Reconstructions of European Landmarks in 4K

Mattia D'Urso, Yuxi Hu, Christian Sormann, Mattia Rossi, Friedrich Fraundorfer

Comments Accepted at 3DMV (CVPR Workshop 2026)

详情
英文摘要

Despite the growing need for data of more and more sophisticated 3D reconstruction pipelines, we can still observe a scarcity of suitable public datasets. Existing 3D datasets are either low resolution, limited to a small amount of scenes, based on images of varying quality because retrieved from the internet, or limited to specific capturing scenarios. Motivated by this lack of suitable 3D datasets, we captured TerraSky3D, a high-resolution large-scale 3D reconstruction dataset comprising 50,000 images divided into 150 ground, aerial, and mixed scenes. The dataset focuses on European landmarks and comes with curated calibration data, camera poses, and depth maps. TerraSky3D tries to answer the need for challenging dataset that can be used to train and evaluate 3D reconstruction-related pipelines.

2603.27494 2026-04-14 cs.CV cs.AI

Learning to Focus and Precise Cropping: A Reinforcement Learning Framework with Information Gaps and Grounding Loss for MLLMs

Xuanpu Zhao, Zhentao Tan, Dianmo Sheng, Tianxiang Chen, Yao Liu, Yue Wu, Tao Gong, Qi Chu, Nenghai Yu

Comments Accepted by CVPR 2026

详情
英文摘要

To enhance the perception and reasoning capabilities of multimodal large language models in complex visual scenes, recent research has introduced agent-based workflows. In these works, MLLMs autonomously utilize image cropping tool to analyze regions of interest for question answering. While existing training strategies, such as those employing supervised fine-tuning and reinforcement learning, have made significant progress, our empirical analysis reveals a key limitation. We demonstrate the model's strong reliance on global input and its weak dependence on the details within the cropped region. To address this issue, we propose a novel two-stage reinforcement learning framework that does not require trajectory supervision. In the first stage, we introduce the ``Information Gap" mechanism by adjusting the granularity of the global image. This mechanism trains the model to answer questions by focusing on cropped key regions, driven by the information gain these regions provide. The second stage further enhances cropping precision by incorporating a grounding loss, using a small number of bounding box annotations. Experiments show that our method significantly enhances the model's attention to cropped regions, enabling it to achieve state-of-the-art performance on high-resolution visual question-answering benchmarks. Our method provides a more efficient approach for perceiving and reasoning fine-grained details in MLLMs. Code is available at: https://github.com/XuanPu-Z/LFPC.

2603.26499 2026-04-14 cs.AI

AIRA_2: Overcoming Bottlenecks in AI Research Agents

Karen Hambardzumyan, Nicolas Baldwin, Edan Toledo, Rishi Hazra, Michael Kuchnik, Bassel Al Omari, Thomas Simon Foster, Anton Protopopov, Jean-Christophe Gagnon-Audet, Ishita Mediratta, Kelvin Niu, Michael Shvartsman, Alisia Lupidi, Alexis Audran-Reiss, Parth Pathak, Tatiana Shavrina, Despoina Magka, Hela Momand, Derek Dunfield, Nicola Cancedda, Pontus Stenetorp, Carole-Jean Wu, Jakob Nicolaus Foerster, Yoram Bachrach, Martin Josifoski

详情
英文摘要

Existing research has identified three structural performance bottlenecks in AI research agents: (1) synchronous single-GPU execution constrains sample throughput, limiting the benefit of search; (2) a generalization gap where validation-based selection causes overfitting and performance to degrade over extended search horizons; and (3) the limited capability of fixed, single-turn LLM operators imposes a ceiling on search performance. We introduce AIRA$_2$, which addresses these bottlenecks through three architectural choices: an asynchronous multi-GPU worker pool that increases experiment throughput linearly; a Hidden Consistent Evaluation protocol that delivers a reliable evaluation signal; and ReAct agents that dynamically scope their actions and debug interactively. On MLE-bench-30, AIRA$^{\dagger}_{2}$ achieves a mean Percentile Rank of 81.5% at 24 hours and 83.1% at 72 hours, outperforming the strongest baseline, which achieves 72.7%. On AIRS-Bench, AIRA$_2$ exceeds human state-of-the-art on 6 out of 20 diverse research tasks. Ablations confirm that each architectural component is necessary, that performance follows a predictable scaling law that transfers across LLM backbones, and that the "overfitting" reported in prior work was driven by evaluation noise rather than true data memorization.

2603.25975 2026-04-14 cs.LG cs.AI cs.CL

Do Neurons Dream of Primitive Operators? Wake-Sleep Compression Rediscovers Schank's Event Semantics

Peter Balogh

详情
英文摘要

We show that they do. Roger Schank's conceptual dependency theory proposed that all human events decompose into primitive operations -- ATRANS (transfer of possession), PTRANS (physical movement), MTRANS (information transfer), and others -- hand-coded from linguistic intuition. We ask: can the same primitives be discovered automatically through compression pressure alone? We adapt DreamCoder's wake-sleep library learning to event state transformations. Given events as before/after world-state pairs, the system searches for operator compositions explaining each event (wake), then extracts recurring patterns as library entries under Minimum Description Length (sleep). Starting from four generic primitives, it discovers operators mapping to Schank's core: MOVE_PROP_has = ATRANS, CHANGE_location = PTRANS, SET_knows = MTRANS, SET_consumed = INGEST, plus compound operators (e.g., "mail" = ATRANS composed with PTRANS) and novel emotional-state operators absent from Schank's taxonomy. We validate on synthetic events, ATOMIC (Sap et al., 2019), and GLUCOSE (Mostafazadeh et al., 2020). On synthetic data, the discovered library achieves MDL within 4% of Schank's hand-coded primitives at 100% coverage (vs. Schank's 81%). On ATOMIC, Schank covers only 10%; on GLUCOSE, 31%. The discovered library covers 100% of both, dominated by mental/emotional operators -- CHANGE_wants (20%), CHANGE_feels (18%), CHANGE_is (18%) -- none in Schank's original taxonomy. Libraries discovered from one corpus transfer to the other with under 1 bit/event degradation despite different annotation schemes and domains, suggesting the operators are information-theoretically determined structure, not dataset artifacts.

2603.23964 2026-04-14 cs.AI

From Pixels to Digital Agents: An Empirical Study on the Taxonomy and Technological Trends of Reinforcement Learning Environments

Lijing Luo, Yiben Luo, Alexey Gorbatovski, Sergey Kovalchuk, Xiaodan Liang

Comments 32 pages main text, 18 figures

详情
英文摘要

The remarkable progress of reinforcement learning (RL) is intrinsically tied to the environments used to train and evaluate artificial agents. Moving beyond traditional qualitative reviews, this work presents a large-scale, data-driven empirical investigation into the evolution of RL environments. By programmatically processing a massive corpus of academic literature and rigorously distilling over 2,000 core publications, we propose a quantitative methodology to map the transition from isolated physical simulations to generalist, language-driven foundation agents. Implementing a novel, multi-dimensional taxonomy, we systematically analyze benchmarks against diverse application domains and requisite cognitive capabilities. Our automated semantic and statistical analysis reveals a profound, data-verified paradigm shift: the bifurcation of the field into a "Semantic Prior" ecosystem dominated by Large Language Models (LLMs) and a "Domain-Specific Generalization" ecosystem. Furthermore, we characterize the "cognitive fingerprints" of these distinct domains to uncover the underlying mechanisms of cross-task synergy, multi-domain interference, and zero-shot generalization. Ultimately, this study offers a rigorous, quantitative roadmap for designing the next generation of Embodied Semantic Simulators, bridging the gap between continuous physical control and high-level logical reasoning.

2603.22962 2026-04-14 cs.LG stat.ML

Asymptotic Learning Curves for Diffusion Models with Random Features Score and Manifold Data

Anand Jerry George, Nicolas Macris

Comments The proof of Lemma 1 in Appendix C is incorrect

详情
英文摘要

We study the theoretical behavior of denoising score matching--the learning task associated to diffusion models--when the data distribution is supported on a low-dimensional manifold and the score is parameterized using a random feature neural network. We derive asymptotically exact expressions for the test, train, and score errors in the high-dimensional limit. Our analysis reveals that, for linear manifolds the sample complexity required to learn the score function scales linearly with the intrinsic dimension of the manifold, rather than with the ambient dimension. Perhaps surprisingly, the benefits of low-dimensional structure starts to diminish once we have a non-linear manifold. These results indicate that diffusion models can benefit from structured data; however, the dependence on the specific type of structure is subtle and intricate.

2603.22241 2026-04-14 cs.CL

MemDLM: Memory-Enhanced DLM Training

Zehua Pei, Hui-Ling Zhen, Weizhe Lin, Sinno Jialin Pan, Yunhe Wang, Mingxuan Yuan, Bei Yu

详情
英文摘要

Diffusion Language Models (DLMs) offer attractive advantages over Auto-Regressive (AR) models, such as full-attention parallel decoding and flexible generation. However, standard DLM training uses a static, single-step masked prediction objective that never exposes the model to the progressive denoising dynamics of inference, and forces all contextual information to be maintained purely through token-space attention, which becomes increasingly diluted as context length grows. We propose MemDLM (Memory-Enhanced DLM), which introduces a second memory channel by embedding a simulated denoising trajectory into training via Bi-level Optimization. An inner loop updates a set of fast weights, forming a Parametric Memory that captures the local trajectory experience, while an outer loop updates the base model conditioned on this memory. By offloading part of the memorization burden from token-space attention to parameter space, MemDLM yields faster convergence, stronger long-context representations, and lower training loss, even when the fast weights are discarded at inference time. Re-enabling the inner loop at inference provides an additional prompt-specific adaptation effect, where the Parametric Memory acts as an emergent in-weight retrieval mechanism on challenging Needle-in-a-Haystack tasks. Code: https://github.com/JarvisPei/MemDLM.

2603.21831 2026-04-14 cs.RO math.DG

Directional Mollification for Knot-Preserving $C^{\infty}$ Smoothing of Polygonal Chains with Explicit Curvature Bounds

Alfredo González-Calvin, Juan F. Jiménez, Héctor García de Marina

详情
英文摘要

Starting from a polygonal chain (a first-order polynomial spline) through prescribed knots (vertices), we introduce the \textit{directional mollification} operator, which acts on polygonal chains and locally integrable functions, and produces $C^{\infty}$ curve approximants arbitrarily close -- pointwise and uniformly on compact subsets -- to the original curve, while still intersecting the original vertices. Unlike standard mollification, which confines the smoothed curve to the convex hull of the image of the original curve and does not preserve the vertices, the directional construction permits local and vertex-preserving smoothing. That is, modifying a single line segment from the polygonal chain alters the $C^{\infty}$ output only on that segment and within an explicitly controllable small neighborhood of its endpoints. The operator admits closed-form curvature bounds and yields infinitely differentiable curves with analytic control over curvature. We further develop a parametric family of smoothing operators that contains both the conventional mollification and the proposed directional variant as special cases, providing a unified geometric framework for converting non-differentiable polygonal data into smooth curves with exact point interpolation, computational simplicity, explicit curvature control, and strong local support properties. These features make the method directly useful for geometric modeling, curve design, and applications that require both smoothness and strict knot/waypoint fidelity, such as in robotics, computer graphics and CNC machining.

2603.18806 2026-04-14 cs.AI

dTRPO: Trajectory Reduction in Policy Optimization of Diffusion Large Language Models

Wenxuan Zhang, Lemeng Wu, Changsheng Zhao, Ernie Chang, Mingchen Zhuge, Zechun Liu, Andy Su, Hanxian Huang, Jun Chen, Chong Zhou, Raghuraman Krishnamoorthi, Vikas Chandra, Mohamed Elhoseiny, Wei Wen

详情
英文摘要

Diffusion Large Language Models (dLLMs) introduce a new paradigm for language generation, which in turn presents new challenges for aligning them with human preferences. In this work, we aim to improve the policy optimization for dLLMs by reducing the cost of the trajectory probability calculation, thereby enabling scaled-up offline policy training. We prove that: (i) under reference policy regularization, the probability ratio of the newly unmasked tokens is an unbiased estimate of that of intermediate diffusion states, and (ii) the probability of the full trajectory can be effectively estimated with a single forward pass of a re-masked final state. By integrating these two trajectory reduction strategies into a policy optimization objective, we propose Trajectory Reduction Policy Optimization (dTRPO). We evaluate dTRPO on 7B dLLMs across instruction-following and reasoning benchmarks. Results show that it substantially improves the core performance of state-of-the-art dLLMs, achieving gains of up to 9.6% on STEM tasks, up to 4.3% on coding tasks, and up to 3.0% on instruction-following tasks. Moreover, dTRPO exhibits strong training efficiency due to its offline, single-forward nature, and achieves improved generation efficiency through high-quality outputs.

2603.12639 2026-04-14 cs.CV

RoboStereo: Dual-Tower 4D Embodied World Models for Unified Policy Optimization

Ruicheng Zhang, Guangyu Chen, Zunnan Xu, Zihao Liu, Zhizhou Zhong, Mingyang Zhang, Jun Zhou, Xiu Li

详情
英文摘要

Scalable Embodied AI faces fundamental constraints due to prohibitive costs and safety risks of real-world interaction. While Embodied World Models (EWMs) offer promise through imagined rollouts, existing approaches suffer from geometric hallucinations and lack unified optimization frameworks for practical policy improvement. We introduce RoboStereo, a symmetric dual-tower 4D world model that employs bidirectional cross-modal enhancement to ensure spatiotemporal geometric consistency and alleviate physics hallucinations. Building upon this high-fidelity 4D simulator, we present the first unified framework for world-model-based policy optimization: (1) Test-Time Policy Augmentation (TTPA) for pre-execution verification, (2) Imitative-Evolutionary Policy Learning (IEPL) leveraging visual perceptual rewards to learn from expert demonstrations, and (3) Open-Exploration Policy Learning (OEPL) enabling autonomous skill discovery and self-correction. Comprehensive experiments demonstrate RoboStereo achieves state-of-the-art generation quality, with our unified framework delivering >97% average relative improvement on fine-grained manipulation tasks.

2603.12221 2026-04-14 cs.CV

A Two-Stage Dual-Modality Model for Facial Emotional Expression Recognition

Jiajun Sun, Zhe Gao

Comments Camera-ready version. 14 pages, 5 figures in total: 8 pages main text with 4 figures, 3 pages references, and 3 pages appendix with 1 figure. Accepted at the 10th ABAW Workshop, CVPR 2026

详情
英文摘要

This paper addresses the expression (EXPR) recognition challenge in the 10th Affective Behavior Analysis in-the-Wild (ABAW) workshop and competition, which requires frame-level classification of eight facial emotional expressions from unconstrained videos. This task is challenging due to inaccurate face localization, large pose and scale variations, motion blur, temporal instability, and other confounding factors across adjacent frames. We propose a two-stage dual-modal (audio-visual) model to address these difficulties. Stage I focuses on robust visual feature extraction with a pretrained DINOv2-based encoder. Specifically, DINOv2 ViT-L/14 is used as the backbone, a padding-aware augmentation (PadAug) strategy is employed for image padding and data preprocessing from raw videos, and a mixture-of-experts (MoE) training head is introduced to enhance classifier diversity. Stage II addresses modality fusion and temporal consistency. For the visual modality, faces are re-cropped from raw videos at multiple scales, and the extracted visual features are averaged to form a robust frame-level representation. Concurrently, frame-aligned Wav2Vec 2.0 audio features are derived from short audio windows to provide complementary acoustic cues. These dual-modal features are integrated via a lightweight gated fusion module, followed by inference-time temporal smoothing. Experiments on the ABAW dataset demonstrate the effectiveness of the proposed method. The two-stage model achieves a Macro-F1 score of 0.5368 on the official validation set and 0.5122 +/- 0.0277 under 5-fold cross-validation, outperforming the official baselines.

2603.11974 2026-04-14 cs.AI

Normative Common Ground Replication (NormCoRe): Replication-by-Translation for Studying Norms in Multi-Agent AI

Luca Deck, Simeon Allmendinger, Lucas Müller, Niklas Kühl

Comments ACM Conference on Fairness, Accountability, and Transparency (ACM FAccT '26)

详情
英文摘要

In the late 2010s, the fashion trend NormCore framed sameness as a signal of belonging, illustrating how norms emerge through collective coordination. Today, similar forms of normative coordination can be observed in systems based on Multi-agent Artificial Intelligence (MAAI), as AI-based agents deliberate, negotiate, and converge on shared decisions in fairness-sensitive domains. Yet, existing empirical approaches often treat norms as targets for alignment or replication, implicitly assuming equivalence between human subjects and AI agents and leaving collective normative dynamics insufficiently examined. To address this gap, we propose Normative Common Ground Replication (NormCoRe), a novel methodological framework to systematically translate the design of human subject experiments into MAAI environments. Building on behavioral science, replication research, and state-of-the-art MAAI architectures, NormCoRe maps the structural layers of human subject studies onto the design of AI agent studies, enabling systematic documentation of study design and analysis of norms in MAAI. We demonstrate the utility of NormCoRe by replicating a seminal experimental study on distributive justice, in which participants negotiate fairness principles under a "veil of ignorance". We show that normative judgments in AI agent studies can differ from human baselines and are sensitive to the choice of the foundation model and the language used to instantiate agent personas. Our work provides a principled pathway for analyzing norms in MAAI and helps to guide, reflect, and document design choices whenever AI agents are used to automate or support tasks formerly carried out by humans.

2603.10079 2026-04-14 cs.LG math.PR

Large Spikes in Stochastic Gradient Descent: A Large-Deviations View

Benjamin Gess, Daniel Heydecker

详情
英文摘要

Large loss spikes in stochastic gradient descent are studied through a rigorous large-deviations analysis for a shallow, fully connected network in the NTK scaling. In contrast to full-batch gradient descent, the catapult phase is shown to split into inflationary and deflationary regimes, determined by an explicit log-drift criterion. In both cases, large spikes are shown to be at least polynomially likely. In addition, these spikes are shown to be the dominant mechanism by which sharp minima are escaped and curvature is reduced, thereby favouring flatter solutions. Corresponding results are also obtained for certain ReLU networks, and implications for curriculum learning are derived.

2602.14812 2026-04-14 cs.CL

Physical Commonsense Reasoning for Lower-Resourced Languages and Dialects: a Study on Basque

Jaione Bengoetxea, Itziar Gonzalez-Dios, Rodrigo Agerri

详情
英文摘要

Physical commonsense reasoning represents a fundamental capability of human intelligence, enabling individuals to understand their environment, predict future events, and navigate physical spaces. Recent years have witnessed growing interest in reasoning tasks within Natural Language Processing (NLP). However, no prior research has examined the performance of Large Language Models (LLMs) on non-question-answering (non-QA) physical commonsense reasoning tasks in low-resource languages such as Basque. Taking the Italian GITA as a starting point, this paper addresses this gap by presenting BasPhyCo, the first non-QA physical commonsense reasoning dataset for Basque, available in both standard and dialectal variants. We evaluate model performance across three hierarchical levels of commonsense understanding: (1) distinguishing between plausible and implausible narratives (accuracy), (2) identifying the conflicting element that renders a narrative implausible (consistency), and (3) determining the specific physical state that creates the implausibility (verifiability). These tasks were assessed using multiple multilingual LLMs as well as models pretrained specifically for Italian and Basque. Results indicate that, in terms of verifiability, LLMs exhibit limited physical commonsense capabilities in low-resource languages such as Basque, especially when processing dialectal variants.

2602.13135 2026-04-14 cs.AI cs.LO

Constrained Assumption-Based Argumentation Frameworks

Emanuele De Angelis, Fabio Fioravanti, Maria Chiara Meo, Alberto Pettorossi, Maurizio Proietti, Francesca Toni

Comments Extended version with proofs and additional results of the full paper accepted at the 25th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2026). DOI: https://doi.org/10.65109/KRAP9309

详情
英文摘要

Assumption-based Argumentation (ABA) is a well-established form of structured argumentation. ABA frameworks with an underlying atomic language are widely studied, but their applicability is limited by a representational restriction to ground (variable-free) arguments and attacks built from propositional atoms. In this paper, we lift this restriction and propose a novel notion of constrained ABA (CABA), whose components, as well as arguments built from them, may include constrained variables, ranging over possibly infinite domains. We define non-ground semantics for CABA, in terms of various notions of non-ground attacks. We show that the new semantics conservatively generalise standard ABA semantics.

2602.12748 2026-04-14 cs.AI cs.HC cs.SE

X-SYS: A Reference Architecture for Interactive Explanation Systems

Tobias Labarta, Nhi Hoang, Maximilian Dreyer, Jim Berend, Oleg Hein, Jackie Ma, Wojciech Samek, Sebastian Lapuschkin

Comments 18 pages, 8 figures

详情
英文摘要

The explainable AI (XAI) research community has proposed numerous technical methods, yet deploying explainability as systems remains challenging: Interactive explanation systems require both suitable algorithms and system capabilities that maintain explanation usability across repeated queries, evolving models and data, and governance constraints. We argue that operationalizing XAI requires treating explainability as an information systems problem where user interaction demands induce specific system requirements. We introduce X-SYS, a reference architecture for interactive explanation systems, that guides (X)AI researchers, developers and practitioners in connecting interactive explanation user interfaces (XUI) with system capabilities. X-SYS organizes around four quality attributes named STAR (scalability, traceability, responsiveness, and adaptability), and specifies a five-component decomposition (XUI Services, Explanation Services, Model Services, Data Services, Orchestration and Governance). It maps interaction patterns to system capabilities to decouple user interface evolution from backend computation. We implement X-SYS through SemanticLens, a system for semantic search and activation steering in vision-language models. SemanticLens demonstrates how contract-based service boundaries enable independent evolution, offline/online separation ensures responsiveness, and persistent state management supports traceability. Together, this work provides a reusable blueprint and concrete instantiation for interactive explanation systems supporting end-to-end design under operational constraints.

2602.10751 2026-04-14 cs.LG

Predicting integers from continuous parameters

Bas Maat, Peter Bloem

详情
英文摘要

We study the problem of predicting numeric labels that are constrained to the integers or to a subrange of the integers. For example, the number of up-votes on social media posts, or the number of bicycles available at a public rental station. While it is possible to model these as continuous values, and to apply traditional regression, this approach changes the underlying distribution on the labels from discrete to continuous. Discrete distributions have certain benefits, which leads us to the question whether such integer labels can be modeled directly by a discrete distribution, whose parameters are predicted from the features of a given instance. Moreover, we focus on the use case of output distributions of neural networks, which adds the requirement that the parameters of the distribution be continuous so that backpropagation and gradient descent may be used to learn the weights of the network. We investigate several options for such distributions, some existing and some novel, and test them on a range of tasks, including tabular learning, sequential prediction and image generation. We find that overall the best performance comes from two distributions: Bitwise, which represents the target integer in bits and places a Bernoulli distribution on each, and a discrete analogue of the Laplace distribution, which uses a distribution with exponentially decaying tails around a continuous mean.

2602.10420 2026-04-14 cs.LG cs.IT eess.IV eess.SP math.IT

Binary Flow Matching: Prediction-Loss Space Alignment for Robust Learning

Jiadong Hong, Lei Liu, Xinyu Bian, Wenjie Wang, Zhaoyang Zhang

Comments 15 pages, 3 tables, 11 figures

详情
英文摘要

Flow matching has emerged as a powerful framework for generative modeling, with recent empirical successes highlighting the effectiveness of signal-space prediction ($x$-prediction). In this work, we investigate the transfer of this paradigm to binary manifolds, a fundamental setting for generative modeling of discrete data. While $x$-prediction remains effective, we identify a latent structural mismatch that arises when it is coupled with velocity-based objectives ($v$-loss), leading to a time-dependent singular weighting that amplifies gradient sensitivity to approximation errors. Motivated by this observation, we formalize prediction-loss alignment as a necessary condition for flow matching training. We prove that re-aligning the objective to the signal space ($x$-loss) eliminates the singular weighting, yielding uniformly bounded gradients and enabling robust training under uniform timestep sampling without reliance on heuristic schedules. Finally, with alignment secured, we examine design choices specific to binary data, revealing a topology-dependent distinction between probabilistic objectives (e.g., cross-entropy) and geometric losses (e.g., mean squared error). Together, these results provide theoretical foundations and practical guidelines for robust flow matching on binary -- and related discrete -- domains, positioning signal-space alignment as a key principle for robust diffusion learning.

2602.03402 2026-04-14 cs.AI cs.LG

Risk Awareness Injection: Calibrating Vision-Language Models for Safety without Compromising Utility

Mengxuan Wang, Yuxin Chen, Gang Xu, Tao He, Hongjie Jiang, Ming Li

详情
英文摘要

Vision language models (VLMs) extend the reasoning capabilities of large language models (LLMs) to cross-modal settings, yet remain highly vulnerable to multimodal jailbreak attacks. Existing defenses predominantly rely on safety fine-tuning or aggressive token manipulations, incurring substantial training costs or significantly degrading utility. Recent research shows that LLMs inherently recognize unsafe content in text, and the incorporation of visual inputs in VLMs frequently dilutes risk-related signals. Motivated by this, we propose Risk Awareness Injection (RAI), a lightweight and training-free framework for safety calibration that restores LLM-like risk recognition by amplifying unsafe signals in VLMs. Specifically, RAI constructs an Unsafe Prototype Subspace from language embeddings and performs targeted modulation on selected high-risk visual tokens, explicitly activating safety-critical signals within the cross-modal feature space. This modulation restores the model's LLM-like ability to detect unsafe content from visual inputs, while preserving the semantic integrity of original tokens for cross-modal reasoning. Extensive experiments across multiple jailbreak and utility benchmarks demonstrate that RAI substantially reduces attack success rate without compromising task performance.

2601.14706 2026-04-14 cs.CV

LookBench: A Live and Holistic Open Benchmark for Fashion Image Retrieval

Gensmo. ai, Chao Gao, Siqiao Xue, Jiwen Fu, Tingyi Gu, Shanshan Li, Fan Zhou

Comments The first two authors contributed equally to this work. Project site: https://serendipityoneinc.github.io/look-bench-page/

详情
英文摘要

In this paper, we present LookBench (We use the term "look" to reflect retrieval that mirrors how people shop -- finding the exact item, a close substitute, or a visually consistent alternative.), a live, holistic and challenging benchmark for fashion image retrieval in real e-commerce settings. LookBench includes both recent product images sourced from live websites and AI-generated fashion images, reflecting contemporary trends and use cases. Each test sample is time-stamped and we intend to update the benchmark periodically, enabling contamination-aware evaluation aligned with declared training cutoffs. Grounded in our fine-grained attribute taxonomy, LookBench covers single-item and outfit-level retrieval across. Our experiments reveal that LookBench poses a significant challenge on strong baselines, with many models achieving below $60\%$ Recall@1. Our proprietary model achieves the best performance on LookBench, and we release an open-source counterpart that ranks second, with both models attaining state-of-the-art results on legacy Fashion200K evaluations. LookBench is designed to be updated semi-annually with new test samples and progressively harder task variants, providing a durable measure of progress. We publicly release our leaderboard, dataset, evaluation code, and trained models.

2601.14477 2026-04-14 cs.CV cs.AI eess.IV

XD-MAP: Cross-Modal Domain Adaptation via Semantic Parametric Maps for Scalable Training Data Generation

Frank Bieder, Hendrik Königshof, Haohao Hu, Fabian Immel, Yinzhe Shen, Jan-Hendrik Pauls, Christoph Stiller

Comments 10 pages, 7 figures, 3 tables, accepted at CVPRW

详情
英文摘要

Until open-world foundation models match the performance of specialized approaches, deep learning systems remain dependent on task- and sensor-specific data availability. To bridge the gap between available datasets and deployment domains, domain adaptation strategies are widely used. In this work, we propose XD-MAP, a novel approach to transfer sensor-specific knowledge from an image dataset to LiDAR, an entirely different sensing domain. Our method leverages detections on camera images to create a semantic parametric map. The map elements are modeled to produce pseudo labels in the target domain without any manual annotation effort. Unlike previous domain transfer approaches, our method does not require direct overlap between sensors and enables extending the angular perception range from a front-view camera to a full 360° view. On our large-scale road feature dataset, XD-MAP outperforms single shot baseline approaches by +19.5 mIoU for 2D semantic segmentation, +19.5 PQth for 2D panoptic segmentation, and +32.3 mIoU in 3D semantic segmentation. The results demonstrate the effectiveness of our approach achieving strong performance on LiDAR data without any manual labeling.

2601.14346 2026-04-14 cs.LG cs.AI

DiSPA: Differential Substructure-Pathway Attention for Drug Response Prediction

Yewon Han, Sunghyun Kim, Eunyi Jeong, Sungkyung Lee, Seokwoo Yun, Sangsoo Lim

详情
英文摘要

Accurate prediction of drug response in precision medicine requires models that capture how specific chemical substructures interact with cellular pathway states. However, most existing deep learning approaches treat chemical and transcriptomic modalities independently or combine them only at late stages, limiting their ability to model fine-grained, context-dependent mechanisms of drug action. In addition, vanilla attention mechanisms are often sensitive to noise and sparsity in high-dimensional biological networks, hindering both generalization and interpretability. We present DiSPA (Differential Substructure-Pathway Attention), a framework that models bidirectional interactions between chemical substructures and pathway-level gene expression. DiSPA introduces differential cross-attention to suppress spurious associations while enhancing context-relevant interactions. On the GDSC benchmark, DiSPA achieves state-of-the-art performance, with strong improvements in the disjoint setting. These gains are consistent across random and drug-blind splits, suggesting improved robustness. Analyses of attention patterns indicate more selective and concentrated interactions compared to standard cross-attention. Exploratory evaluation shows that differential attention better prioritizes predefined target-related pathways, although this does not constitute mechanistic validation. DiSPA also shows promising generalization on external datasets (CTRP) and cross-dataset settings, although further validation is needed. It further enables zero-shot application to spatial transcriptomics, providing exploratory insights into region-specific drug sensitivity patterns without ground-truth validation.

2601.13844 2026-04-14 cs.LG

Optimal L2 Regularization in High-dimensional Continual Linear Regression

Gilad Karpel, Edward Moroshko, Ran Levinstein, Ron Meir, Daniel Soudry, Itay Evron

Comments Accepted to ALT 2026

详情
英文摘要

We study generalization in an overparameterized continual linear regression setting, where a model is trained with L2 (isotropic) regularization across a sequence of tasks. We derive a closed-form expression for the expected generalization loss in the high-dimensional regime that holds for arbitrary linear teachers. We demonstrate that isotropic regularization mitigates label noise under both single-teacher and multiple i.i.d. teacher settings, whereas prior work accommodating multiple teachers either did not employ regularization or used memory-demanding methods. Furthermore, we prove that the optimal fixed regularization strength scales nearly linearly with the number of tasks $T$, specifically as $T/\ln T$. To our knowledge, this is the first such result in theoretical continual learning. Finally, we validate our theoretical findings through experiments on linear regression and neural networks, illustrating how this scaling law affects generalization and offering a practical recipe for the design of continual learning systems.

2601.12104 2026-04-14 cs.CL cs.AI cs.CR

Powerful Training-Free Membership Inference Against Autoregressive Language Models

David Ilić, David Stanojević, Kostadin Cvejoski

Comments 9 pages, 2 figures; appendix with additional experiments and derivations

详情
英文摘要

Fine-tuned language models pose significant privacy risks, as they may memorize and expose sensitive information from their training data. Membership inference attacks (MIAs) provide a principled framework for auditing these risks, yet existing methods achieve limited detection rates, particularly at the low false-positive thresholds required for practical privacy auditing. We present EZ-MIA, a membership inference attack that exploits a key observation: memorization manifests most strongly at error positions, specifically tokens where the model predicts incorrectly yet still shows elevated probability for training examples. We introduce the Error Zone (EZ) score, which measures the directional imbalance of probability shifts at error positions relative to a pretrained reference model. This principled statistic requires only two forward passes per query and no model training of any kind. On WikiText with GPT-2, EZ-MIA achieves 3.8x higher detection than the previous state-of-the-art under identical conditions (66.3% versus 17.5% true positive rate at 1% false positive rate), with near-perfect discrimination (AUC 0.98). At the stringent 0.1% FPR threshold critical for real-world auditing, we achieve 8x higher detection than prior work (14.0% versus 1.8%), requiring no reference model training. These gains extend to larger architectures: on AG News with Llama-2-7B, we achieve 3x higher detection (46.7% versus 15.8% TPR at 1% FPR). These results establish that privacy risks of fine-tuned language models are substantially greater than previously understood, with implications for both privacy auditing and deployment decisions. Code is available at https://github.com/JetBrains-Research/ez-mia.

2601.12038 2026-04-14 cs.AI

Subargument Argumentation Frameworks: Separating Direct Conflict from Structural Dependency

Beishui Liao

Comments The original title, "Abstract Argumentation with Subargument Relations," has been replaced by "Subargument Argumentation Frameworks: Separating Direct Conflict from Structural Dependency"

详情
英文摘要

Dung's abstract argumentation frameworks model acceptability solely in terms of an attack relation, thereby conflating two conceptually distinct aspects of argumentative reasoning: direct conflict between arguments and the structural dependencies that arise from their internal composition. While this abstraction preserves extension-based semantics, it obscures how justification is grounded in subarguments and how defeats propagate through argument structure. We introduce Subargument Argumentation Frameworks (SAFs), an abstract framework in which direct attack and subargumenthood are represented as independent primitive relations. This separation makes structural dependency explicit at the representational level while leaving its semantic impact to be determined by structure-sensitive notions of defence, admissibility, and complete semantics defined within the framework. We show that projecting SAFs onto attack-only frameworks yields extension-equivalent Dung frameworks under all standard semantics, yet the projection irreversibly loses information about justificatory grounding and structural propagation. SAFs therefore provide strictly greater representational expressiveness while remaining semantically compatible with Dung's theory, thereby offering a principled basis for structure-sensitive accounts of defence, justification, and explanation in abstract argumentation.

2601.09270 2026-04-14 cs.CL

MCGA: A Multi-task Classical Chinese Literary Genre Audio Corpus

Yexing Du, Kaiyuan Liu, Bihe Zhang, Youcheng Pan, Bo Yang, Liangyu Huo, Xiyuan Zhang, Jian Xie, Daojing He, Yang Xiang, Ming Liu, Bing Qin

Comments Accepted in ACL 2026 (Findings)

详情
英文摘要

With the rapid advancement of Multimodal Large Language Models (MLLMs), their potential has gained significant attention in Chinese Classical Studies (CCS). While existing research primarily focuses on text and visual modalities, the audio corpus within this domain remains largely underexplored. To bridge this gap, we introduce the Multi-task Classical Chinese Literary Genre Audio Corpus (MCGA), a 119-hour corpus comprising 22,000 audio samples. It encompasses a diverse range of literary genres across six tasks: Automatic Speech Recognition (ASR), Speech-to-Text Translation (S2TT), Speech Emotion Captioning (SEC), Spoken Question Answering (SQA), Speech Understanding (SU), and Speech Reasoning (SR). Through the evaluation of ten MLLMs, our experimental results demonstrate that current MLLMs still face substantial challenges on the MCGA test set. Furthermore, we introduce a domain-specific metric for SEC and a metric to measure the consistency between speech and text capabilities. We release MCGA to the public to facilitate the development of more robust MLLMs. MCGA Corpus: https://github.com/yxduir/MCGA