arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2086
2603.16309 2026-05-08 cs.CL

Omnilingual MT: Machine Translation for 1,600 Languages

Omnilingual MT Team, Belen Alastruey, Niyati Bafna, Andrea Caciolai, Kevin Heffernan, Artyom Kozhevnikov, Christophe Ropers, Eduardo Sánchez, Charles-Eric Saint-James, Ioannis Tsiamas, Xiang "Tony" Cao, Chierh Cheng, Joe Chuang, Paul-Ambroise Duquenne, Mark Duppenthaler, Nate Ekberg, Cynthia Gao, Pere Lluís Huguet Cabot, João Maria Janeiro, Jean Maillard, Gabriel Mejia Gonzalez, Holger Schwenk, Edan Toledo, Arina Turkatenko, Albert Ventayol-Boada, Rashel Moritz, Alexandre Mourachko, Surya Parimi, Mary Williamson, Shireen Yates, David Dale, Marta R. Costa-jussà

详情
英文摘要

High-quality machine translation (MT) can scale to hundreds of languages, setting a high bar for multilingual systems. However, compared to the world's 7,000 languages, current systems still offer only limited coverage: about 200 languages on the target side, and maybe a few hundreds more on the source side, supported due to cross-lingual transfer. And even these numbers have been hard to evaluate due to the lack of reliable benchmarks and metrics. We present Omnilingual Machine Translation (OMT), the first MT system supporting more than 1,600 languages. This scale is enabled by a comprehensive data strategy that integrates large public multilingual corpora with newly created datasets, including manually curated MeDLEY bitext. We explore two ways of specializing a Large Language model (LLM) for machine translation: as a decoder-only model (OMT-LLaMA) or as a module in an encoder-decoder architecture (OMT-NLLB). Notably, all our 1B to 8B parameter models match or exceed the MT performance of a 70B LLM baseline, revealing a clear specialization advantage and enabling strong translation quality in low-compute settings. Moreover, our evaluation of English-to-1,600 translations further shows that while baseline models can interpret undersupported languages, they frequently fail to generate them with meaningful fidelity; OMT-LLaMA models substantially expand the set of languages for which coherent generation is feasible. Additionally, OMT models improve in cross-lingual transfer, being close to solving the "understanding" part of the puzzle in MT for the 1,600 evaluated. Our leaderboard and main human-created evaluation datasets (BOUQuET and Met-BOUQuET) are dynamically evolving towards Omnilinguality and freely available.

2603.16281 2026-05-08 cs.LG q-bio.NC

Laya: A LeJEPA Approach to EEG via Latent Prediction over Reconstruction

Saarang Panchavati, Uddhav Panchavati, Hiroki Nariai, Corey Arnold, William Speier

详情
英文摘要

Electroencephalography (EEG) is a widely used tool for studying brain function, with applications in clinical neuroscience, diagnosis, and brain-computer interfaces (BCIs). Recent EEG foundation models trained on large unlabeled corpora aim to learn transferable representations, but their effectiveness remains unclear; reported improvements over smaller task-specific models are often modest, sensitive to downstream adaptation and fine-tuning strategies, and limited under linear probing. We hypothesize that one contributing factor is the reliance on signal reconstruction as the primary self-supervised learning (SSL) objective, which biases representations toward high-variance artifacts rather than task-relevant neural structure. To address this limitation, we explore an SSL paradigm based on Joint Embedding Predictive Architectures (JEPA), which learn by predicting latent representations instead of reconstructing raw signals. We introduce Laya, the first EEG foundation model based on LeJEPA. We show that latent prediction yields representations that encode semantic structure in EEG: Laya embeddings track clinically meaningful state changes such as seizure onset, are resilient to noise, and achieve the strongest mean clinical accuracy under frozen linear probing, with particular gains on tasks where relevant neural patterns are subtle and easily obscured by artifacts. Controlled ablations against matched MAE variants confirm that the choice of pretraining objective, rather than architecture or data, is the primary driver of these gains.

2603.15646 2026-05-08 cs.LG cs.AI cs.CL

Alternating Reinforcement Learning with Contextual Rubric Rewards: Beyond the Scalarization Strategy

Guangchen Lan, Lian Xiong, Xin Zhou, Hejie Cui, Yuwei Zhang, Mao Li, Zhenyu Shi, Besnik Fetahu, Lihong Li, Xian Li

详情
英文摘要

Reinforcement Learning with Rubric Rewards (RLRR) is a framework that extends conventional reinforcement learning from human feedback (RLHF) and verifiable rewards (RLVR) by replacing scalar preference signals with structured, multi-dimensional, contextual rubric-based evaluations. However, existing approaches in RLRR are limited to linearly compressing vector rewards into a scalar reward with a fixed weightings, which is sensitive to artificial score design and fails to capture correlations among reward dimensions. To overcome the limitations of reward aggregation, this work proposes Alternating Reinforcement Learning with Rubric Rewards (ARL-RR), a framework that eliminates the need for a fixed scalarization by optimizing one semantic rubric meta-class at a time. Theoretically, we show that reward aggregation induces a variance contraction effect, which helps explain the performance gains. We further introduce a lightweight, search-based adaptation procedure that selects the next meta-class dynamically based on task performance, enabling the policy to emphasize critical objectives and thereby improve the model performance. Empirically, our experiments on the HealthBench dataset with experts annotations demonstrate that ARL-RR uniformly outperforms scalarized methods in both model performance and training efficiency across different model scales (1.7B, 4B, 8B, and 14B).

2603.14337 2026-05-08 cs.CV

On the Nature of Attention Sink that Shapes Decoding Strategy in Omni-LLMs

Suho Yoo, Youngjoon Jang, Joon Son Chung

Comments Preprint

详情
英文摘要

The goal of this paper is to strengthen the reasoning of Omnimodal Large Language Models (Omni-LLMs) at inference time, without additional training. These models jointly process video, audio, and text, and given the large number of tokens they consume, how attention is routed across them is central to their behaviour. We focus specifically on attention sinks, tokens that absorb a disproportionate share of attention mass regardless of their semantic content, to understand how this routing unfolds. To this end, we conduct a systematic analysis of sink behaviour in Omni-LLMs. Our analysis yields two key findings: (i) high sink attention does not solely indicate head redundancy, suggesting that sink value representations play additional functional roles; (ii) the sink value vector acts as a shared bias added to every token's output, serving as a global signal that organises the representation as a whole. Building on this, we propose OutRo, which correspondingly aligns non-sink token representations with the sink in feature space, and relaxes the causal mask for sink tokens at an early layer to sharpen this bias before the rest of decoding proceeds. This design enhances the reasoning process without requiring additional forward passes or access to attention maps. Based on extensive experiments, OutRo consistently improves performance on seven video QA benchmarks and demonstrates strong generalisation, while incurring only a 1.1x decoding overhead.

2603.14209 2026-05-08 cs.CV cs.AI

ChArtist: Generating Pictorial Charts with Unified Spatial and Subject Control

Shishi Xiao, Tongyu Zhou, David Laidlaw, Gromit Yeuk-Yin Chan

Comments Project page: https://chartist-ai.github.io/

详情
英文摘要

A pictorial chart is an effective medium for visual storytelling, seamlessly integrating visual elements with data charts. However, creating such images is challenging because the flexibility of visual elements often conflicts with the rigidity of chart structures. This process thus requires a creative deformation that maintains both data faithfulness and visual aesthetics. Current methods that extract dense structural cues from natural images (e.g., edge or depth maps) are ill-suited as conditioning signals for pictorial chart generation. We present ChArtist, a domain-specific diffusion model for generating pictorial charts automatically, offering two distinct types of control: 1) spatial control that aligns well with the chart structure, and 2) subject-driven control that respects the visual characteristics of a reference image. To achieve this, we introduce a skeleton-based spatial control representation. This representation encodes only the data-encoding information of the chart, allowing for the easy incorporation of reference visuals without a rigid outline constraint. We implement our method based on the Diffusion Transformer (DiT) and leverage an adaptive position encoding mechanism to manage these two controls. We further introduce Spatially Gated Attention to modulate the interaction between spatial control and subject control. To support the fine-tuning of pre-trained models for this task, we created a large-scale dataset of 30,000 triplets (skeleton, reference image, pictorial chart). We also propose a unified data accuracy metric to evaluate the data faithfulness of the generated charts. We believe this work demonstrates that current generative models can achieve data-driven visual storytelling by moving beyond general-purpose conditions to task-specific representations. Project page: https://chartist-ai.github.io/.

2603.09986 2026-05-08 cs.CL cs.AI

Quantifying Hallucinations in Language Language Models on Medical Textbooks

Brandon C. Colelough, Davis Bartels, Dina Demner-Fushman

Comments 8 pages, 4 figures

详情
英文摘要

Hallucinations, the tendency for large language models to provide responses with factually incorrect and unsupported claims, is a serious problem within natural language processing for which we do not yet have an effective solution to mitigate against. Existing benchmarks for medical QA rarely evaluate this behavior against a fixed evidence source. We ask how often hallucinations occur on textbook-grounded QA and how responses to medical QA prompts vary across models. We conduct two experiments, the first experiment to determine the prevalence of hallucinations for a prominent open source large language model (LLaMA-70B-Instruct) in medical QA given closed-source zero-shot prompts, and the second experiment to determine the prevalence of hallucinations and clinician preference to model responses. We observed, in experiment one, with the passages provided, LLaMA-70B-Instruct hallucinated in 19.7\% of answers (95\% CI 18.6 to 20.7) even though 98.8\% of prompt responses received maximal plausibility, and observed in experiment two, across models, lower hallucination rates aligned with higher usefulness scores ($ρ=-0.71$, $p=0.058$). Clinicians produced high agreement (quadratic weighted $κ=0.92$) and ($τ_b=0.06$ to $0.18$, $κ=0.57$ to $0.61$) for experiments 1 and 2 respectively. Our findings indicate that, across all scales and architectures tested, current large language models remain unfit for unsupervised clinical deployment, and that human expert oversight is both necessary and the dominant cost driver.

2603.03511 2026-05-08 cs.LG cond-mat.mtrl-sci physics.chem-ph

Orbital Transformers for Predicting Wavefunctions in Time-Dependent Density Functional Theory

Xuan Zhang, Haiyang Yu, Chengdong Wang, Jacob Helwig, Shuiwang Ji, Xiaofeng Qian

详情
Journal ref
The Fourteenth International Conference on Learning Representations (ICLR 2026)
英文摘要

We aim to learn wavefunctions simulated by time-dependent density functional theory (TDDFT), which can be efficiently represented as linear combination coefficients of atomic orbitals. In real-time TDDFT, the electronic wavefunctions of a molecule evolve over time in response to an external excitation, enabling first-principles predictions of physical properties such as optical absorption, electron dynamics, and high-order response. However, conventional real-time TDDFT relies on time-consuming propagation of all occupied states with fine time steps. In this work, we propose OrbEvo, which is based on an equivariant graph transformer architecture and learns to evolve the full electronic wavefunction coefficients across time steps. First, to account for external field, we design an equivariant conditioning to encode both strength and direction of external electric field and break the symmetry from SO(3) to SO(2). Furthermore, we design two OrbEvo models, OrbEvo-WF and OrbEvo-DM, using wavefunction pooling and density matrix as interaction method, respectively. Motivated by the central role of the density functional in TDDFT, OrbEvo-DM encodes the density matrix aggregated from all occupied electronic states into feature vectors via tensor contraction, providing a more intuitive approach to learn the time evolution operator. We adopt a training strategy specifically tailored to limit the error accumulation of time-dependent wavefunctions over autoregressive rollout. To evaluate our approach, we generate TDDFT datasets consisting of 5,000 different molecules in the QM9 dataset and 1,500 molecular configurations of the malonaldehyde molecule in the MD17 dataset. Results show that our OrbEvo model accurately captures quantum dynamics of excited states under external field, including time-dependent wavefunctions, time-dependent dipole moment, and optical absorption spectra.

2603.02087 2026-05-08 cs.CV cs.AI cs.LG

A Detection-Gated Pipeline for Robust Glottal Area Waveform Extraction and Clinical Pathology Assessment

Harikrishnan Unnikrishnan, Rita Patel

Comments for associated code see: https://github.com/hari-krishnan/openglottal

详情
英文摘要

We present a fully automated, two-stage modular glottal area segmentation framework for high-speed videoendoscopy (HSV) designed for accuracy, generalizability, and real-time playback. Our detection-gated pipeline combines a YOLOv8n glottis localizer with a U-Net segmenter; the localizer defines a tight crop to ensure a consistent field of view and gates the output to reduce spurious segmentations during glottal closure. The models were trained on the GIRAFE (N=600) and BAGLS (N=55,750) datasets. Cross-dataset portability was evaluated by benchmarking GIRAFE-trained models on the BAGLS test set without fine-tuning. In these evaluations, the pipeline achieved a Dice Similarity Coefficient (DSC) of 0.745 (87% of the in-domain ceiling). On in-distribution test sets, the system achieved DSCs of 0.81 (GIRAFE) and 0.856 (BAGLS), outperforming or competing with state-of-the-art methods. An exploratory clinical study of 40 subjects demonstrated that the glottal area Coefficient of Variation (CV distinguished healthy from pathological function (p=0.006). The system processes ~35 frames per second on commodity hardware, enabling interactive clinical review. This design supports uniform extraction of laryngeal kinematic measures across varying acquisition settings. Code, weights, and software are available at https://github.com/hari-krishnan/openglottal.

2603.00117 2026-05-08 cs.RO cs.AI

PEPA: a Persistently Autonomous Embodied Agent with Personalities

Kaige Liu, Yang Li, Lijun Zhu, Weinan Zhang

详情
英文摘要

Living organisms exhibit persistent autonomy through internally generated goals and self-sustaining behavioral organization, yet current embodied agents remain driven by externally scripted objectives. This dependence on predefined task specifications limits their capacity for long-term deployment in dynamic, unstructured environments where continuous human intervention is impractical. We propose that personality traits provide an intrinsic organizational principle for achieving persistent autonomy. Analogous to genotypic biases shaping biological behavioral tendencies, personalities enable agents to autonomously generate goals and sustain behavioral evolution without external supervision. To realize this, we develop PEPA, a three-layer cognitive architecture that operates through three interacting systems: Sys3 autonomously synthesizes personality-aligned goals and refines them via episodic memory and daily self-reflection; Sys2 performs deliberative reasoning to translate goals into executable action plans; Sys1 grounds the agent in sensorimotor interaction, executing actions and recording experiences. We validate the framework through real-world deployment on a quadruped robot in a multi-floor office building. Operating without reliance on fixed task specifications, the robot autonomously arbitrates between user requests and personality-driven motivations, navigating elevators and exploring environments accordingly. Quantitative analysis across five distinct personality prototypes demonstrates stable, trait-aligned behaviors. The results confirm that personality-driven cognitive architectures enable sustained autonomous operation characteristic of persistent embodied systems. Code and demo videos are available at https://sites.google.com/view/pepa-persistent/.

2602.22710 2026-05-08 cs.SD cs.AI cs.HC

Same Words, Different Judgments: How Preferences Vary Across Modalities

Aaron Broukhim, Nadir Weibel, Eshin Jolly

Comments Submitted to NeurIPS 2026 for review

详情
英文摘要

Preference-based reinforcement learning (PbRL) is the dominant framework for aligning AI systems to human preferences. However, evaluation protocols for such data were designed for text and have not been validated for speech. We present the first ICC-based, controlled cross-modal study of human and synthetic preference annotations, comparing text and audio evaluations of identical semantic content across 100 prompts. We show that achieving $\textit{good}$ agreement within either modality (ICC(2,$k$) $\approx$ .80) requires $\sim$9 raters. At the same time, modalities show marked differences in how people report preferences: audio raters exhibit narrower decision thresholds, reduced length bias, and more user-oriented evaluation criteria, with near-chance cross-modality agreement. We demonstrate that synthetic ratings can be used to effectively predict inter-rater agreement, thus serving as an early signal for stimulus selection and proxy for human annotations. Together, these findings argue that evaluation protocols for audio preference data require modality-specific design rather than direct adaptation from text.

2602.19202 2026-05-08 cs.CV

UniE2F: A Unified Diffusion Framework for Event-to-Frame Reconstruction with Video Foundation Models

Gang Xu, Zhiyu Zhu, Junhui Hou

详情
英文摘要

Event cameras excel at high-speed, low-power, and high-dynamic-range scene perception. However, as they fundamentally record only relative intensity changes rather than absolute intensity, the resulting data streams suffer from a significant loss of spatial information and static texture details. In this paper, we address this limitation by leveraging the generative prior of a pre-trained video diffusion model to reconstruct high-fidelity video frames from sparse event data. Specifically, we first establish a baseline model by directly applying event data as a condition to synthesize videos. Then, based on the physical correlation between the event stream and video frames, we further introduce the event-based inter-frame residual guidance to enhance the accuracy of video frame reconstruction. Furthermore, we extend our method to video frame interpolation and prediction in a zero-shot manner by modulating the reverse diffusion sampling process, thereby creating a unified event-to-frame reconstruction framework. Experimental results on real-world and synthetic datasets demonstrate that our method significantly outperforms previous approaches both quantitatively and qualitatively. We also refer the reviewers to the video demo contained in the supplementary material for video results. The code will be publicly available at https://github.com/CS-GangXu/UniE2F.

2602.15827 2026-05-08 cs.RO cs.AI cs.LG cs.SY eess.SY

Perceptive Humanoid Parkour: Chaining Dynamic Human Skills via Motion Matching

Zhen Wu, Xiaoyu Huang, Lujie Yang, Yuanhang Zhang, Xi Chen, Pieter Abbeel, Rocky Duan, Angjoo Kanazawa, Carmelo Sferrazza, Guanya Shi, C. Karen Liu

详情
英文摘要

While recent advances in humanoid locomotion have achieved stable walking on varied terrains, capturing the agility and adaptivity of highly dynamic human motions remains an open challenge. In particular, agile parkour in complex environments demands not only low-level robustness, but also human-like motion expressiveness, long-horizon skill composition, and perception-driven decision-making. In this paper, we present Perceptive Humanoid Parkour (PHP), a modular framework that enables humanoid robots to autonomously perform long-horizon, vision-based parkour across challenging obstacle courses. Our approach first leverages motion matching, formulated as nearest-neighbor search in a feature space, to compose retargeted atomic human skills into long-horizon kinematic trajectories. This framework enables the flexible composition and smooth transition of complex skill chains while preserving the elegance and fluidity of dynamic human motions. Next, we train motion-tracking reinforcement learning (RL) expert policies for these composed motions, and distill them into a single depth-based, multi-skill student policy, using a combination of DAgger and RL. Crucially, the combination of perception and skill composition enables autonomous, context-aware decision-making: using only onboard depth sensing and a discrete 2D velocity command, the robot selects and executes whether to step over, climb onto, vault or roll off obstacles of varying geometries and heights. We validate our framework with extensive real-world experiments on a Unitree G1 humanoid robot, demonstrating highly dynamic parkour skills such as climbing tall obstacles up to 1.25m (96% robot height), as well as long-horizon multi-obstacle traversal with closed-loop adaptation to real-time obstacle perturbations.

2602.13670 2026-05-08 cs.LG

Advancing Analytic Class-Incremental Learning through Vision-Language Calibration

Binyu Zhao, Wei Zhang, Xingrui Yu, Zhaonian Zou, Ivor Tsang

Comments 20 pages, 11 figures, 9 tables. Accepted by ICML2026

详情
英文摘要

Class-incremental learning (CIL) with pre-trained models (PTMs) faces a critical trade-off between efficient adaptation and long-term stability. While analytic learning enables rapid, recursive closed-form updates, its efficacy is often compromised by accumulated errors and feature incompatibility. In this paper, we first conduct a systematic study to dissect the failure modes of PTM-based analytic CIL, identifying representation rigidity as the primary bottleneck. Motivated by this insight, we propose VILA, a novel dual-branch framework that advances analytic CIL via a two-level vision-language calibration strategy. Specifically, we coherently fuse plastic, task-adapted features with a frozen, universal visual anchor at the feature level through geometric calibration, and leverage cross-modal semantic priors at the decision level to rectify prediction bias. This confluence maintains analytic-learning's extreme efficiency while overcoming its inherent brittleness. Extensive experiments across eight benchmarks demonstrate that VILA consistently yields superior performance, particularly in fine-grained and long-sequence scenarios. Our framework harmonizes high-fidelity prediction with the simplicity of analytic learning. Our code is available at https://github.com/byzhaoAI/VILA.

2602.13310 2026-05-08 cs.CV cs.AI

Visual Para-Thinker: Divide-and-Conquer Reasoning for Visual Comprehension

Haoran Xu, Hongyu Wang, Jiaze Li, Shunpeng Chen, Zizhao Tong, Jianzhong Ju, Zhenbo Luo, Jian Luan

详情
英文摘要

Existing LLM test-time scaling laws emphasize the emergence of self-reflective behaviors through extended reasoning length. Nevertheless, this vertical scaling strategy often encounters plateaus in exploration as the model becomes locked into specific thinking pattern. By shifting from depth to parallelism, parallel thinking mitigates the narrowing of exploration. However, the extension of this paradigm to visual domain remains an open research question. In this paper, we first examine the role of visual partitioning in parallelized reasoning and subsequently propose two distinct strategies. Based on the above, we introduce Visual Para-Thinker, representing the inaugural parallel reasoning framework for MLLMs. To maintain path independence and promote diversity in reasoning, our approach integrates Pa-Attention alongside LPRoPE. Leveraging the vLLM framework, we have developed a native multimodal implementation that facilitates high-efficiency parallel processing. Empirical results on benchmark datasets such as V*, CountBench, RefCOCO, and HallusionBench confirm that Visual Para-Thinker successfully extends the benefits of parallel reasoning to the visual domain.

2602.11229 2026-05-08 cs.AI cs.LG

Latent Generative Solvers for Generalizable Long-Term Physics Simulation

Zituo Chen, Sili Deng

详情
英文摘要

Reliable physics simulation demands two capabilities that today's neural PDE solvers do not deliver together: generalization across heterogeneous PDE families, and stability under long autoregressive rollouts. Deterministic operators accumulate error geometrically, while existing probabilistic solvers are confined to a single PDE family or short horizons. We close this gap with the \textbf{Latent Generative Solver} (LGS), three coupled components: (i) a Physics VAE (PhyVAE) compressing twelve PDE families into a shared latent manifold; (ii) a Pyramidal Flow-Forcing Transformer (PFlowFT) that generates the next latent by flow matching, conditioned on a per-trajectory context updated on the model's own predictions; and (iii) input noising during training, for which we derive a sufficient-condition contraction bound explaining the observed long-horizon stability. Pretrained on a 2.5\,M-trajectory, 16-system corpus at $128^2$, LGS matches the strongest deterministic baseline at one step, wins on 15/16 systems at both 5- and 10-step rollout, cuts 20-step L2RE from $56.1\%$ to $\mathbf{30.2\%}$, and uses $\mathbf{13}$--$\mathbf{77\times}$ less recurrent dynamics-step compute. It also adapts efficiently to a $256^2$ Kolmogorov flow held out from the pretraining corpus, dropping 1-step L2RE from $0.398$ to $0.129$ in five finetune epochs against U-AFNO's $0.653{\to}0.343$.

2602.11183 2026-05-08 cs.RO cs.CV cs.SY eess.SY

Mitigating Error Accumulation in Continuous Navigation via Memory-Augmented Kalman Filtering

Yin Tang, Jiawei Ma, Jinrui Zhang, Alex Jinpeng Wang, Deyu Zhang

Comments ICML 2026 Camera Ready

详情
英文摘要

Continuous navigation in complex environments is critical for Unmanned Aerial Vehicle (UAV). However, the existing Vision-Language Navigation (VLN) models follow the dead-reckoning, which iteratively updates its position for the next waypoint prediction, and subsequently construct the complete trajectory. Then, such stepwise manner will inevitably lead to accumulated errors of position over time, resulting in misalignment between internal belief and objective coordinates, which is known as "state drift" and ultimately compromises the full trajectory prediction. Drawing inspiration from classical control theory, we propose to correct for errors by formulating such sequential prediction as a recursive Bayesian state estimation problem. In this paper, we design NeuroKalman, a novel framework that decouples navigation into two complementary processes: a Prior Prediction, based on motion dynamics and a Likelihood Correction, from historical observation. We first mathematically associate Kernel Density Estimation of the measurement likelihood with the attention-based retrieval mechanism, which then allows the system to rectify the latent representation using retrieved historical anchors without gradient updates. Comprehensive experiments on TravelUAV benchmark demonstrate that, with only 10% of the training data fine-tuning, our method clearly outperforms strong baselines and regulates drift accumulation.

2602.07974 2026-05-08 cs.LG

Structural Learning Theory: A Metric-Topology Factorization Approach

Xin Li

详情
英文摘要

Learning in structured, multi-context, or non-stationary environments involves two orthogonal difficulties. The first is \emph{metric}: once the correct context is known, how hard is prediction within it? This is the domain of Statistical Learning Theory (SLT). The second is \emph{structural}: how many local contexts are required, and how can they be discovered from data? This paper develops \emph{Structural Learning Theory} (StrLT) for the structural axis. We introduce \emph{width}, the minimum number of jointly contractive and low-risk cells needed to cover a learning problem. Width is incomparable with VC dimension: either can diverge while the other remains bounded. We show that width induces a \emph{phase transition}: if the allocated number of cells \(K<w\), learning suffers an irreducible structural error floor; if \(K\ge w\), the problem reduces to ordinary within-cell statistical learning. To estimate width, we introduce the \emph{contractive-similarity} (CS) operator, a task-adaptive graph kernel combining geometric locality with predictive compatibility. Its CS Laplacian exposes contractive basins through spectral separation. We further develop the \emph{metric slingshot}, which reuses low-dimensional latent contraction maps to reduce funnel-learning cost. Together, width, CS estimation, and the slingshot decompose learning into trap discovery and funnel generalization, with deep implications for continual and lifelong learning in an open-ended environment.

2602.07906 2026-05-08 cs.LG cs.AI

AceGRPO: Adaptive Curriculum Enhanced Group Relative Policy Optimization for Autonomous Machine Learning Engineering

Yuzhu Cai, Zexi Liu, Xinyu Zhu, Cheng Wang, Yanfeng Wang, Siheng Chen

Comments 18 pages, 5 figures

详情
英文摘要

Autonomous Machine Learning Engineering (MLE) requires agents to perform sustained, iterative optimization over long horizons. While recent LLM-based agents show promise, current prompt-based agents for MLE suffer from behavioral stagnation due to frozen parameters. Although Reinforcement Learning (RL) offers a remedy, applying it to MLE is hindered by prohibitive execution latency and inefficient data selection. Recognizing these challenges, we propose AceGRPO with two core components: (1) Evolving Data Buffer that continuously repurposes execution traces into reusable training tasks, and (2) Adaptive Sampling guided by a Learnability Potential function, which dynamically prioritizes tasks at the agent's learning frontier to maximize learning efficiency. Leveraging AceGRPO, our trained Ace-30B model achieves a 100% valid submission rate on MLE-Bench-Lite, approaches the performance of proprietary frontier models, and outperforms larger open-source baselines (e.g., DeepSeek-V3.2), demonstrating robust capability for sustained iterative optimization. Code is available at https://github.com/yuzhu-cai/AceGRPO.

2602.07322 2026-05-08 cs.RO cs.AI

Action-to-Action Flow Matching

Jindou Jia, Gen Li, Xiangyu Chen, Tuo An, Yuxuan Hu, Jingliang Li, Xinying Guo, Jianfei Yang

Comments 20 pages, 19 figures

详情
英文摘要

Diffusion-based policies have recently achieved remarkable success in robotics by formulating action prediction as a conditional denoising process. However, the standard practice of sampling from random Gaussian noise often requires multiple iterative steps to produce clean actions, leading to high inference latency that incurs a major bottleneck for real-time control. In this paper, we challenge the necessity of uninformed noise sampling and propose Action-to-Action flow matching (A2A), a novel policy paradigm that shifts from random sampling to initialization informed by the previous proprioceptive action. Unlike existing methods that treat proprioceptive action feedback as static conditions, A2A leverages historical proprioceptive sequences, embedding them into a high-dimensional latent space as the starting point for action generation. This design bypasses costly iterative denoising while effectively capturing the robot's physical dynamics and temporal continuity. Extensive experiments demonstrate that A2A exhibits high training efficiency, fast inference speed, and improved generalization. Notably, A2A enables high-quality action generation in as few as a single inference step, and exhibits superior robustness to visual perturbations and enhanced generalization to unseen configurations. Lastly, we also extend A2A to video generation, demonstrating its broader versatility in temporal modeling. Project site: https://lorenzo-0-0.github.io/A2A_Flow_Matching.

2602.05896 2026-05-08 cs.LG cs.AI

Parity, Sensitivity, and Transformers

Alexander Kozachinskiy, Tomasz Steifer, Przemysław Wałȩga

Comments 15 pages. Version 2 -- lower bound extended from 1-layer 1-head to 1-layer O(1)-head transformers

详情
英文摘要

Understanding what neural architectures can and cannot compute is a central challenge in the theory of AI. One of the fundamental problems in this context is the PARITY task, which asks whether the number of 1s in a binary input sequence is even or odd. PARITY is one of the central tasks studied in the theory of computation, yet it remains surprisingly unclear under which conditions transformers can or cannot solve it. In this paper, we show that the minimal number of layers a transformer needs to compute PARITY is two. In particular, we solve the open problem asking whether a one-layer transformer can compute PARITY. We answer it negatively by showing that average sensitivity of a one-layer transformer grows slower than that of PARITY. Furthermore, we show a new construction for transformer that computes PARITY, which improves on the existing constructions by removing a number of impractical assumptions. In particular, the existing transformers for PARITY rely on such impractical assumptions as length-dependent positional encoding, hardmax, layernorm without a regularisation parameter, or incompatibility with causal masking. We show that these assumptions can be removed, at the cost of increasing the number of layers from two to four. Specifically, we show that PARITY can be computed by a four-layer transformer using softmax attention, length-independent and polynomially bounded positional encoding, no layernorm, and compatible with both causal and non-causal masking.

2602.02493 2026-05-08 cs.CV cs.AI

PixelGen: Improving Pixel Diffusion with Perceptual Supervision

Zehong Ma, Ruihan Xu, Shiliang Zhang

Comments Project Pages: https://zehong-ma.github.io/PixelGen/

详情
英文摘要

Pixel diffusion generates images directly in pixel space, avoiding the VAE artifacts and representational bottlenecks of two-stage latent diffusion. Recent JiT further simplifies pixel diffusion with x-prediction, where the model predicts clean images rather than velocity. However, the standard pixel-wise diffusion loss treats all pixels equally, spending model capacity to perceptually insignificant signals and often leading to blurry samples. We propose PixelGen, an end-to-end pixel diffusion framework that augments x-prediction with perceptual supervision. Specifically, PixelGen introduces two complementary perceptual losses on top of x-prediction: an LPIPS loss for local textures and a P-DINO loss for global semantics. To preserve sample coverage, PixelGen further proposes a noise-gating strategy that applies these losses only at lower-noise timesteps. On ImageNet-256 without classifier-free guidance, PixelGen achieves an FID of 5.11 in 80 training epochs, surpassing the latent diffusion baselines. Moreover, PixelGen scales efficiently to text-to-image generation, reaching a GenEval score of 0.79 with only 6 days of training on 8xH800 GPUs. These results show that perceptual supervision substantially narrows the gap between pixel and latent diffusion while preserving a simple one-stage pipeline. Codes are available at https://github.com/Zehong-Ma/PixelGen.

2602.02288 2026-05-08 cs.LG cs.AI

AROpt: An Optimization Method for Autoregressive Time Series Forecasting

Zheng Li, Jerry Cheng, Huanying Gu

Comments 16 pages, 5 figures, 3 tables

详情
英文摘要

Current time-series forecasting models are primarily based on transformer-style neural networks. These models achieve long-term forecasting mainly by scaling up the model size rather than through genuinely autoregressive (AR) rollout. From the perspective of large language model training, traditional time-series forecasting model training ignores the monotonic error-growth heuristic. In this paper, we propose a novel training method for time-series forecasting that enforces two key properties: (1) AR prediction errors should increase with the forecasting horizon. Violations of this trend are interpreted as rollout inconsistency and are softly penalized during training, and (2) the method enables models to be able to concatenate short-term AR predictions to form flexible long-term forecasts. Empirical results demonstrate that our method establishes a new state-of-the-art across multiple benchmarks, achieving an MSE reduction of more than $10\%$ compared to iTransformer and other recent strong baselines. Furthermore, it enables short-horizon forecasting models to perform reliable long-term predictions at horizons over 7.5 times longer. Code is available at https://github.com/LizhengMathAi/AROpt

2602.01505 2026-05-08 cs.LG stat.ML

Optimal Sample Complexity for Single Time-Scale Actor-Critic with Momentum

Navdeep Kumar, Tehila Dahan, Lior Cohen, Ananyabrata Barua, Giorgia Ramponi, Kfir Yehuda Levy, Shie Mannor

Comments Following further internal verification, we identified foundational issues in the analytical framework, including unresolved problems in the treatment of nonstationary sampling and parts of the coupled convergence analysis under the stated assumptions. Addressing these issues requires a substantial overhaul of the theoretical framework beyond a standard revision

详情
英文摘要

We establish an optimal sample complexity of $O(ε^{-2})$ for obtaining an $ε$-optimal global policy using a single-timescale actor-critic (AC) algorithm in infinite-horizon discounted Markov decision processes (MDPs) with finite state-action spaces, improving upon the prior state of the art of $O(ε^{-3})$. Our approach applies STORM (STOchastic Recursive Momentum) to reduce variance in the critic updates. However, because samples are drawn from a nonstationary occupancy measure induced by the evolving policy, variance reduction via STORM alone is insufficient. To address this challenge, we maintain a buffer of small fraction of recent samples and uniformly sample from it for each critic update. Importantly, these mechanisms are compatible with existing deep learning architectures and require only minor modifications, without compromising practical applicability.

2602.01150 2026-05-08 cs.LG cs.AI cs.CR cs.CV math.OC

SMI: Statistical Membership Inference for Reliable Unlearned Model Auditing

Jialong Sun, Zeming Wei, Jiaxuan Zou, Jiacheng Gong, Jie Fu, Chengyang Dong, Heng Xu, Jialong Li, Bo Liu

详情
英文摘要

Machine unlearning (MU) is essential for enforcing the right to be forgotten in machine learning systems. A key challenge of MU is how to reliably audit whether a model has truly forgotten specified training data. Membership Inference Attacks (MIAs) are widely used for unlearned model auditing, where samples that evade membership detection are regarded as successfully forgotten. We show this assumption is fundamentally flawed: failed membership inference does not imply true forgetting. We prove that unlearned samples occupy fundamentally different positions in the feature space than non-member samples, making this alignment bias unavoidable and unobservable, which leads to systematically optimistic evaluations of unlearning performance. Meanwhile, training shadow models for MIA incurs substantial computational overhead. To address both limitations, we propose Statistical Membership Inference (SMI), a training-free auditing framework that reformulates auditing as estimating the non-member mixture proportion in the unlearned feature distribution. Beyond estimating the forgetting rate, SMI also provides bootstrap reference ranges for quantified auditing reliability. Extensive experiments show that SMI consistently outperforms all MIA-based baselines, with no shadow model training required. Overall, SMI establishes a principled and efficient alternative to MIA-based auditing methods, with both theoretical guarantees and strong empirical performance.

2602.01124 2026-05-08 cs.LG

ChronoSpike: An Adaptive Spiking Graph Neural Network for Dynamic Graphs

Md Abrar Jahin, Taufikur Rahman Fuad, Jay Pujara, Craig Knoblock

详情
英文摘要

Dynamic graph representation learning requires capturing both structural relations and temporal evolution, yet existing approaches face a core trade-off: attention-based methods offer expressiveness at $O(T^2)$ complexity, while recurrent architectures suffer from gradient pathologies and dense state storage. Spiking neural networks provide event-driven efficiency but are constrained by sequential propagation, binary information loss, and local aggregation that lacks global context. We propose ChronoSpike, an adaptive spiking graph neural network that integrates learnable LIF neurons with per-channel membrane dynamics, multi-head spatially-attentive aggregation over continuous features, and a lightweight Transformer temporal encoder. This design enables fine-grained local modeling and long-range dependency capture with $O(T \cdot d)$ activation/state memory and an additional $O(T^2)$ per-node attention term that remains small for the horizons evaluated here. ChronoSpike outperforms twelve state-of-the-art baselines on three large benchmarks by $2.0$% Macro-F1 and $2.4$% Micro-F1 on average while achieving $3-10\times$ faster training than recurrent methods with a constant 105K-parameter budget independent of graph size. We provide theoretical guarantees for membrane potential boundedness, gradient flow stability under contraction factor $ρ<1$, and BIBO stability; interpretability analyses reveal heterogeneous temporal receptive fields and a learned primacy effect with $83-88$% sparsity.

2602.00407 2026-05-08 cs.LG

Fed-Listing: Federated Label Distribution Inference in Graph Neural Networks

Suprim Nakarmi, Junggab Son, Yue Zhao, Zuobin Xiong

Comments 9 pages, 3 figures, and 4 tables

详情
英文摘要

Federated Graph Neural Networks (FedGNNs) facilitate collaborative learning across multiple clients with graph-structured data while preserving user privacy. However, emerging research indicates that within this setting, shared model updates, particularly gradients, can unintentionally leak sensitive information of local users. Numerous privacy inference attacks have been explored in traditional federated learning and extended to graph settings, but the problem of label distribution inference in FedGNNs remains largely underexplored. In this work, we introduce Fed-Listing (Federated Label Distribution Inference in GNNs), a novel gradient-based attack designed to infer the private label statistics of target clients in FedGNNs without access to raw data or node features. Fed-Listing only leverages the final-layer gradients exchanged during training to uncover statistical patterns that reveal class proportions in a stealthy manner. Extensive experiments on four benchmark datasets and three GNN architectures show that Fed-Listing significantly outperforms existing baselines, including random guessing and Decaf, even under challenging non-i.i.d. scenarios. Moreover, existing defense mechanisms can barely reduce the attack performance of Fed-Listing, unless the model's utility is severely degraded. The code implementation and Supplementary materials are available here: https://github.com/suprimnakarmi/Fed-Listing.

2601.21187 2026-05-08 cs.CV cs.LG

FRISM: Fine-Grained Reasoning Injection via Subspace-Level Model Merging for Vision-Language Models

Chenyu Huang, Peng Ye, Xudong Tan, Jinhan Mu, Shenghe Zheng, Li Shen, Tao Chen

Comments Accepted by ICML 2026

详情
英文摘要

Efficiently enhancing the reasoning capabilities of Vision-Language Models (VLMs) by merging them with Large Reasoning Models (LRMs) has emerged as a promising direction. However, existing methods typically operate at a coarse-grained layer level, which often leads to a trade-off between injecting reasoning capabilities and preserving visual capabilities. To address this limitation, we propose FRISM (Fine-grained Reasoning Injection via Subspace-level model Merging), a fine-grained reasoning injection framework based on subspace-level model merging. Observing that different SVD subspaces contribute differently to reasoning and perception, FRISM decomposes LRM task vectors via Singular Value Decomposition (SVD) and adaptively tunes the scaling coefficients of each subspace through learning to realize fine-grained reasoning injection. Furthermore, we introduce a label-free self-distillation learning strategy with dual-objective optimization using common vision-language perception datasets. Extensive experiments demonstrate that FRISM effectively improves reasoning capabilities while largely preserving the model's visual capabilities by consistently achieving strong performance across diverse visual-language reasoning benchmarks.

2601.20375 2026-05-08 cs.LG cs.AI cs.CL

LLM-AutoDP: Automatic Data Processing via LLM Agents for Model Fine-tuning

Wei Huang, Anda Cheng, Yinggui Wang, Lei Wang, Tao Wei

Comments Accepted by VLDB2026

详情
英文摘要

Large Language Models (LLMs) can be fine-tuned on domain-specific data to enhance their performance in specialized fields. However, such data often contains numerous low-quality samples, necessitating effective data processing (DP). In practice, DP strategies are typically developed through iterative manual analysis and trial-and-error adjustment. These processes inevitably incur high labor costs and may lead to privacy issues in high-privacy domains like healthcare due to direct human access to sensitive data. Thus, achieving automated data processing without exposing the raw data has become a critical challenge. To address this challenge, we propose LLM-AutoDP, a novel framework that leverages LLMs as agents to automatically generate and optimize data processing strategies. Our method generates multiple candidate strategies and iteratively refines them using feedback signals and comparative evaluations. This iterative in-context learning mechanism enables the agent to converge toward high-quality processing pipelines without requiring direct human intervention or access to the underlying data. To further accelerate strategy search, we introduce three key techniques: Distribution Preserving Sampling, which reduces data volume while maintaining distributional integrity; Processing Target Selection, which uses a binary classifier to identify low-quality samples for focused processing; Cache-and-Reuse Mechanism}, which minimizes redundant computations by reusing prior processing results. Results show that models trained on data processed by our framework achieve over 80% win rates against models trained on unprocessed data. Compared to AutoML baselines based on LLM agents, LLM-AutoDP achieves approximately a 65% win rate. Moreover, our acceleration techniques reduce the total searching time by up to 10 times, demonstrating both effectiveness and efficiency.

2601.15395 2026-05-08 cs.CL cs.AI cs.HC

Beyond Fixed Psychological Personas: State Beats Trait, but Language Models are State-Blind

Tamunotonye Harry, Ivoline Ngong, Chima Nweke, Yuanyuan Feng, Joseph Near

Comments Accepted to Findings of ACL 2026

详情
英文摘要

User interactions with language models vary due to static properties of the user (trait) and the specific context of the interaction (state). However, existing persona datasets (like PersonaChat, PANDORA etc.) capture only trait, and ignore the impact of state. We introduce Chameleon, a dataset of 5,001 contextual psychological profiles from 1,667 Reddit users, each measured across multiple contexts. Using the Chameleon dataset, we present three key findings. First, inspired by Latent State-Trait theory, we decompose variance and find that 74% is within-person(state) while only 26% is between-person (trait). Second, we find that LLMs are state-blind: they focus on trait only, and produce similar responses regardless of state. Third, we find that reward models react to user state, but inconsistently: different models favor or penalize the same users in opposite directions. We release Chameleon to support research on affective computing, personalized dialogue, and RLHF alignment.

2601.14594 2026-05-08 cs.CV

LFS: Learnable Frame Selector for Event-Aware and Temporally Diverse Video Captioning

Lianying Chao, Linfeng Yin, Peiyu Ren, Yifan Jiang, Qiaoyu Ren, Dingcheng Shan, Jing-cheng Pang, Sijie Wu, Xubin Li, Kai Zhang, Xin Chen

详情
英文摘要

Video captioning models convert frames into visual tokens and generate descriptions with large language models (LLMs). Since encoding all frames is prohibitively expensive, uniform sampling is the default choice, but it enforces equal temporal coverage while ignoring the uneven events distribution. This motivates a Learnable Frame Selector (LFS) that selects temporally diverse and event-relevant frames. LFS explicitly models temporal importance to balance temporal diversity and event relevance, and employs a stratified strategy to ensure temporal coverage while avoiding clustering. Crucially, LFS leverages caption feedback from frozen video-LLMs to learn frame selection that directly optimizes downstream caption quality. Additionally, we identify the gap between existing benchmark and human's cognition. Thus, we introduce ICH-CC built from carefully designed questions by annotators that reflect human-consistent understanding of video. Experiments indicate that LFS consistently improves detailed video captioning across two representative community benchmarks and ICH-CC, achieving up to 2.0% gains on VDC and over 4% gains on ICH-CC. Moreover, we observe that enhanced captions with LFS leads to improved performance on video question answering. Overall, LFS provides an effective and easy-to-integrate solution for detailed video captioning.