arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 1857
2601.07663 2026-04-22 cs.AI cs.CL

Reasoning Models Will Sometimes Lie About Their Reasoning

William Walden, Miriam Wanner

详情
英文摘要

Hint-based faithfulness evaluations have established that Large Reasoning Models (LRMs) may not say what they think: they do not always volunteer information about how key parts of the input (e.g. answer hints) influence their reasoning. Yet, these evaluations also fail to specify what models should do when confronted with hints or other unusual prompt content -- even though versions of such instructions are standard security measures (e.g. for countering prompt injections). Here, we study faithfulness under this more realistic setting in which models are explicitly alerted to the possibility of unusual inputs. We find that such instructions can yield strong results on faithfulness metrics from prior work. However, results on new, more granular metrics proposed in this work paint a mixed picture: although models may acknowledge the presence of hints, they will often deny intending to use them -- even when permitted to use hints and even when it can be demonstrated that they are using them. Our results thus raise broader challenges for CoT monitoring and interpretability.

2601.07056 2026-04-22 cs.CV cs.AI

Adversarial Attacks on Medical Hyperspectral Imaging Exploiting Spectral-Spatial Dependencies and Multiscale Features

Yunrui Gu, Zhenzhe Gao, Cong Kong, Jiawei Du, Zhaoxia Yin

详情
英文摘要

Medical hyperspectral imaging (MHSI) has shown strong potential for disease diagnosis by capturing spectral-spatial information of tissues. While deep learning has substantially improved MHSI classification accuracy, its robustness remains limited due to the well-known trade-off between accuracy and robustness in Deep Neural Networks (DNNs). This issue is particularly critical in MHSI, where reliable prediction depends on local tissue relationships and multiscale spectral-spatial structures. A practical way to improve robustness is to identify the most unstable adversarial examples and incorporate them into adversarial training. However, existing attack methods do not sufficiently exploit these MHSI-specific properties, leading to suboptimal attack effectiveness and limited value for robustness enhancement. To address this gap, we propose a structured adversarial attack framework for MHSI that progressively models its local spectral-spatial dependencies and multiscale hierarchical representations. The proposed method generates anatomically consistent perturbations by modeling neighborhood dependencies and hierarchical spectral-spatial features. Experiments on the brain and choledoch datasets show that our method more effectively degrades lesion-related classification performance in critical tumor regions than existing baselines while maintaining low perturbation magnitude. These results reveal a clinically relevant robustness weakness in current MHSI models and provide stronger adversarial samples for developing targeted defense strategies.

2601.04925 2026-04-22 cs.CL

Can AI-Generated Persuasion Be Detected? Persuaficial Benchmark and AI vs. Human Linguistic Differences

Arkadiusz Modzelewski, Paweł Golik, Anna Kołos, Giovanni Da San Martino

Comments Accepted to ACL 2026 Main Conference

详情
英文摘要

Large Language Models (LLMs) can generate highly persuasive text, raising concerns about their misuse for propaganda, manipulation, and other harmful purposes. This leads us to our central question: Is LLM-generated persuasion more difficult to automatically detect than human-written persuasion? To address this, we categorize controllable generation approaches for producing persuasive content with LLMs and introduce Persuaficial, a high-quality multilingual benchmark covering six languages: English, German, Polish, Italian, French and Russian. Using this benchmark, we conduct extensive empirical evaluations comparing human-authored and LLM-generated persuasive texts. We find that although overtly persuasive LLM-generated texts can be easier to detect than human-written ones, subtle LLM-generated persuasion consistently degrades automatic detection performance. Beyond detection performance, we provide the first comprehensive linguistic analysis contrasting human and LLM-generated persuasive texts, offering insights that may guide the development of more interpretable and robust detection tools.

2601.04562 2026-04-22 cs.AI

Reasoning Over Space: Enabling Geographic Reasoning for LLM-Based Generative Next POI Recommendation

Dongyi Lv, Qiuyu Ding, Heng-Da Xu, Zhaoxu Sun, Zhi Wang, Feng Xiong, Mu Xu

详情
英文摘要

Generative recommendation with large language models (LLMs) reframes prediction as sequence generation, yet existing LLM-based recommenders remain limited in leveraging geographic signals that are crucial in mobility and local-services scenarios. Here, we present Reasoning Over Space (ROS), a framework that utilizes geography as a vital decision variable within the reasoning process. ROS introduces a Hierarchical Spatial Semantic ID (SID) that discretizes coarse-to-fine locality and POI semantics into compositional tokens, and endows LLM with a three-stage Mobility Chain-of-Thought (CoT) paradigm that models user personality, constructs an intent-aligned candidate space, and performs locality informed pruning. We further align the model with real world geography via spatial-guided Reinforcement Learning (RL). Experiments on three widely used location-based social network (LBSN) datasets show that ROS achieves over 10% relative gains in hit rate over strongest LLM-based baselines and improves cross-city transfer, despite using a smaller backbone model.

2512.13684 2026-04-22 cs.CV

Recurrent Video Masked Autoencoders

Daniel Zoran, Nikhil Parthasarathy, Yi Yang, Drew A Hudson, Joao Carreira, Andrew Zisserman

详情
英文摘要

We present Recurrent Video Masked-Autoencoders (RVM): a novel approach to video representation learning that leverages recurrent computation to model the temporal structure of video data. RVM couples an asymmetric masking objective with a transformer-based recurrent neural network to aggregate information over time, training solely on a simple pixel reconstruction loss. This design yields a highly efficient "generalist" encoder: RVM achieves competitive performance with state-of-the-art video models (e.g. VideoMAE, V-JEPA) on video-level tasks like action classification, and point and object tracking, while matching or exceeding the performance of image models (e.g. DINOv2) on tasks that require strong geometric and dense spatial features. Notably, RVM achieves strong performance in the small-model regime without requiring knowledge distillation, exhibiting up to 30x greater parameter efficiency than competing video masked autoencoders. Finally, we demonstrate that RVM's recurrent nature allows for stable feature propagation over long temporal horizons with linear computational cost, overcoming some of the limitations of standard spatio-temporal attention-based video models. Ablation studies further highlight the factors driving the model's success, with qualitative results showing that RVM learns rich representations of scene semantics, structure, and motion.

2511.21931 2026-04-22 cs.LG cs.AI

Does the Model Say What the Data Says? A Simple Heuristic for Model Data Alignment

Henry Salgado, Meagan R. Kendall, Martine Ceberio

详情
Journal ref
International Workshop on Causality, Agents and Large Models, 2025
英文摘要

In this work, we propose a simple and computationally efficient framework for evaluating whether machine learning models align with the structure of the data they learn from; that is, whether the model says what the data says. Unlike existing interpretability methods that focus exclusively on explaining model behavior, our approach establishes a baseline derived directly from the data itself. Drawing inspiration from Rubin's Potential Outcomes Framework, we quantify how strongly each feature separates the two outcome groups in a binary classification task, moving beyond traditional descriptive statistics to estimate each feature's effect on the outcome. By comparing these data-derived feature rankings with model-based explanations, we provide practitioners with an interpretable and model-agnostic method for assessing model-data alignment.

2511.21893 2026-04-22 cs.LG

Breaking the Illusion: Consensus-Based Generative Mitigation of Adversarial Illusions in Multi-Modal Embeddings

Fatemeh Akbarian, Anahita Baninajjar, Yingyi Zhang, Ananth Balashankar, Amir Aminifar

详情
英文摘要

Multi-modal foundation models align images, text, and other modalities in a shared embedding space but remain vulnerable to adversarial illusions [35], where imperceptible perturbations disrupt cross-modal alignment and mislead downstream tasks. To counteract the effects of adversarial illusions, we propose a task-agnostic mitigation mechanism that purifies the attacker's perturbed input using generative models, e.g., Variational Autoencoders (VAEs), to restore natural alignment. To further enhance the defense mechanism, we adopt a generative sampling strategy combined with a consensus-based aggregation scheme over the outcomes of the generated samples. Our experiments on ImageBind, a state-of-the-art multi-modal encoder, show that our approach substantially reduces the illusion attack success rates to near-zero and improves cross-modal alignment in unperturbed and perturbed input settings, providing an effective and task-agnostic defense against adversarial illusions. The code is available at https://github.com/fatemehakb/adversarial-illusions-mitigation.

2511.16164 2026-04-22 cs.LG stat.AP

Achieving Skilled and Reliable Daily Probabilistic Forecasts of Wind Power at Subseasonal-to-Seasonal Timescales over France

Eloi Lindas, Yannig Goude, Philippe Ciais

详情
Journal ref
2026 Environ. Res.: Energy 3 025007
英文摘要

In a growing renewable based energy system, accurate and reliable wind power forecasts are crucial for grid stability, balancing supply and demand and market risk management. Even though short-term weather forecasts have been thoroughly used to provide up to 3 days ahead renewable power predictions, forecasts involving prediction horizons longer than a week still need investigations. Despite the recent progress in subseasonal-to-seasonal weather probabilistic forecasting, their use for wind power prediction usually involves both temporal and spatial aggregation to achieve reasonable skill. In this study, we present a lead time and numerical weather model agnostic forecasting pipeline which enables to transform ECMWF subseasonal-to-seasonal weather forecasts into wind power forecasts for France for lead times ranging from 1 day to 46 days at daily resolution. By leveraging a post-processing step of the resulting power ensembles we show that these forecasts improve the climatological baseline by 15% to 5% for the Continuous Ranked Probability Score and 20% to 5% for ensemble Mean Squared Error up to 16 days in advance, before converging towards the climatological skill. This improvement in skill is jointly obtained with near perfect calibration of the forecasts for every lead time. The results suggest that electricity market players could benefit from the extended forecast range up to two weeks to improve their decision making on renewable supply

2511.11793 2026-04-22 cs.CL

MiroThinker: Pushing the Performance Boundaries of Open-Source Research Agents via Model, Context, and Interactive Scaling

MiroMind Team, Song Bai, Lidong Bing, Carson Chen, Guanzheng Chen, Yuntao Chen, Zhe Chen, Ziyi Chen, Jifeng Dai, Xuan Dong, Wenhan Dou, Yue Deng, Yunjie Fu, Junqi Ge, Chenxia Han, Tammy Huang, Zhenhang Huang, Jerry Jiao, Shilei Jiang, Tianyu Jiao, Xiaoqi Jian, Lei Lei, Ruilin Li, Gen Luo, Tiantong Li, Xiang Lin, Ziyuan Liu, Zhiqi Li, Jie Ni, Qiang Ren, Pax Sun, Shiqian Su, Chenxin Tao, Bin Wang, Wenhai Wang, Haonan Wang, James Wang, Jin Wang, Jojo Wang, Letian Wang, Shizun Wang, Weizhi Wang, Zixuan Wang, Jinfan Xu, Sen Xing, Chenyu Yang, Hai Ye, Jiaheng Yu, Yue Yu, Muyan Zhong, Tianchen Zhao, Xizhou Zhu, Yanpeng Zhou, Yifan Zhang, Zhi Zhu

Comments Technical Report

详情
英文摘要

We present MiroThinker v1.0, an open-source research agent designed to advance tool-augmented reasoning and information-seeking capabilities. Unlike previous agents that only scale up model size or context length, MiroThinker explores interaction scaling at the model level, systematically training the model to handle deeper and more frequent agent-environment interactions as a third dimension of performance improvement. Unlike LLM test-time scaling, which operates in isolation and risks degradation with longer reasoning chains, interactive scaling leverages environment feedback and external information acquisition to correct errors and refine trajectories. Through reinforcement learning, the model achieves efficient interaction scaling: with a 256K context window, it can perform up to 600 tool calls per task, enabling sustained multi-turn reasoning and complex real-world research workflows. Across four representative benchmarks-GAIA, HLE, BrowseComp, and BrowseComp-ZH-the 72B variant achieves up to 81.9%, 37.7%, 47.1%, and 55.6% accuracy respectively, surpassing previous open-source agents and approaching commercial counterparts such as GPT-5-high. Our analysis reveals that MiroThinker benefits from interactive scaling consistently: research performance improves predictably as the model engages in deeper and more frequent agent-environment interactions, demonstrating that interaction depth exhibits scaling behaviors analogous to model size and context length. These findings establish interaction scaling as a third critical dimension for building next-generation open research agents, complementing model capacity and context windows.

2511.08418 2026-04-22 cs.LG

Physics-Informed Neural Operators for Cardiac Electrophysiology

Hannah Lydon, Milad Kazemi, Martin Bishop, Nicola Paoletti

Comments All code used in this work, including experimental results, can be found at https://github.com/janet-9/CardiacEP-PINOS This work was accepted for a poster presentation at the 2026 L4DC conference

详情
英文摘要

Accurately simulating systems governed by PDEs, such as voltage fields in cardiac electrophysiology (EP) modelling, remains a significant modelling challenge. Traditional numerical solvers are computationally expensive and sensitive to discretisation, while canonical deep learning methods are data-hungry and struggle with chaotic dynamics and long-term predictions. Physics-Informed Neural Networks (PINNs) mitigate some of these issues by incorporating physical constraints in the learning process, yet they remain limited by mesh resolution and long-term predictive stability. In this work, we propose a Physics-Informed Neural Operator (PINO) approach to solve PDE problems in cardiac EP. Unlike PINNs, PINO models learn mappings between function spaces, allowing them to generalise to multiple mesh resolutions and initial conditions. Our results show that PINO models can accurately reproduce cardiac EP dynamics over extended time horizons and across multiple propagation scenarios, including zero-shot evaluations on scenarios unseen during training. Additionally, our PINO models maintain high predictive quality in long roll-outs (where predictions are recursively fed back as inputs), and can scale their predictive resolution by up to 10x the training resolution. These advantages come with a significant reduction in simulation time compared to numerical PDE solvers, highlighting the potential of PINO-based approaches for efficient and scalable cardiac EP simulations.

2511.05540 2026-04-22 cs.RO cs.AI cs.CV cs.LG cs.NE

Constructing the Umwelt: Cognitive Planning through Belief-Intent Co-Evolution

Shiyao Sang

Comments 12 pages, 8 figures. A paradigm shift from reconstructing the world to understanding it: planning through Belief-Intent Co-Evolution

详情
英文摘要

This paper challenges a prevailing epistemological assumption in End-to-End Autonomous Driving: that high-performance planning necessitates high-fidelity world reconstruction. Inspired by cognitive science, we propose the Mental Bayesian Causal World Model (MBCWM) and instantiate it as the Tokenized Intent World Model (TIWM), a novel cognitive computing architecture. Its core philosophy posits that intelligence emerges not from pixel-level objective fidelity, but from the Cognitive Consistency between the agent's internal intentional world and physical reality. By synthesizing von Uexküll's $\textit{Umwelt}$ theory, the neural assembly hypothesis, and the triple causal model (integrating symbolic deduction, probabilistic induction, and force dynamics) into an end-to-end embodied planning system, we demonstrate the feasibility of this paradigm on the nuPlan benchmark. Experimental results in open-loop validation confirm that our Belief-Intent Co-Evolution mechanism effectively enhances planning performance. Crucially, in closed-loop simulations, the system exhibits emergent human-like cognitive behaviors, including map affordance understanding, free exploration, and self-recovery strategies. We identify Cognitive Consistency as the core learning mechanism: during long-term training, belief (state understanding) and intent (future prediction) spontaneously form a self-organizing equilibrium through implicit computational replay, achieving semantic alignment between internal representations and physical world affordances. TIWM offers a neuro-symbolic, cognition-first alternative to reconstruction-based planners, establishing a new direction: planning as active understanding, not passive reaction.

2511.04320 2026-04-22 cs.RO

MacroNav: Multi-Task Context Representation Learning Enables Efficient Navigation in Unknown Environments

Kuankuan Sima, Longbin Tang, Zhenyu Yang, Haozhe Ma, Lin Zhao

Comments Accepted by IEEE Robotics and Automation Letters

详情
英文摘要

Autonomous navigation in unknown environments requires multi-scale spatial understanding that captures geometric details, topological connectivity, and global structure to support high-level decision making under partial observability. Existing approaches struggle to efficiently capture such multi-scale spatial understanding while maintaining low computational cost for real-time navigation. We present MacroNav, a learning-based navigation framework featuring two key components: (1) a lightweight context encoder trained via multi-task self-supervised learning to capture multi-scale, navigation-centric spatial representations; and (2) a reinforcement learning policy that seamlessly integrates these representations with graph-based reasoning for efficient action selection. Extensive experiments demonstrate the context encoder's effective and robust environmental understanding. Real-world deployments further validate MacroNav's effectiveness, yielding significant gains over state-of-the-art navigation methods in both Success Rate (SR) and Success weighted by Path Length (SPL), with superior computational efficiency.

2510.20087 2026-04-22 cs.CV

Endoshare: A Publicly Available, Surgeons-Friendly Solution to De-Identify and Manage Surgical Videos

Lorenzo Arboit, Dennis N. Schneider, Britty Baby, Vinkle Srivastav, Pietro Mascagni, Nicolas Padoy

Comments 13 pages, 6 figures. Source-available software: https://camma-public.github.io/Endoshare/

详情
Journal ref
Surg Endosc, 2026
英文摘要

Video-based assessment and surgical data science can advance surgical training, research, and quality improvement, yet adoption remains limited by heterogeneous recording formats and privacy concerns linked to video sharing. This work develops, evaluates, and publicly releases Endoshare, a surgeon-friendly application that merges, standardizes, and de-identifies endoscopic videos. Development followed an iterative, user-centered software life cycle. In the analysis phase, an internal survey of four clinicians and four computer scientists, based on 10 usability heuristics, identified early requirements and guided a cross-platform, privacy-by-design architecture. Prototype testing reported high usability for clinicians (4.68 +/- 0.40 out of 5) and for computer scientists (4.03 +/- 0.51 out of 5), with the lowest score (4.00 +/- 0.93 out of 5) relating to label clarity, prompting interface refinement to streamline case selection, video merging, automated out-of-body removal, and filename pseudonymization. In the testing phase, ten surgeons completed an external survey combining the same heuristics with Technology Acceptance Model constructs, reporting high perceived usefulness (5.07 +/- 1.75 out of 7), ease of use (5.15 +/- 1.71 out of 7), heuristic usability (4.38 +/- 0.48 out of 5), and strong recommendation likelihood (9.20 +/- 0.79 out of 10). A performance assessment across different hardware and configurations showed that processing time increased proportionally with video duration and was consistently lower in fast mode. Endoshare is a publicly available solution to manage surgical videos, with potential to support training, research, and quality improvement. Compliance certification and broader interoperability validation are needed to establish it as a reliable tool for surgical video management. The software is available at https://camma-public.github.io/Endoshare

2510.16822 2026-04-22 cs.CV cs.AI

ReefNet: A Large-Scale Dataset and Benchmark for Fine-Grained Coral Reef Recognition

Abdulwahab Felemban, Yahia Battach, Faizan Farooq Khan, Yuqian Fu, Xuhui Liu, Yesmeen M. Khattab, Yousef A. Radwan, Xiang Li, Fabio Marchese, Sara Beery, Burton H. Jones, Francesca Benzoni, Mohamed Elhoseiny

详情
英文摘要

Coral reefs are rapidly declining under anthropogenic pressures (e.g., climate change), creating an urgent need for scalable and automated monitoring. Progress in data-driven coral analysis, however, is constrained by the scarcity of large-scale datasets with fine-grained labels that are taxonomically consistent across sites and studies. To address this gap, we introduce ReefNet, a large-scale public coral reef image dataset with point-level annotations mapped to the World Register of Marine Species (WoRMS) taxonomy. ReefNet aggregates imagery from 76 curated CoralNet sources and an additional reef site from Al-Wajh (Red Sea), totaling approximately 925K genus-level hard coral annotations. Through expert-driven verification and targeted filtering, we derive a high-confidence benchmark subset with 92% expert agreement over 39 hard-coral label classes, enabling reliable evaluation under realistic label noise and strong class imbalance. Beyond dataset construction, we establish a comprehensive benchmark spanning zero-shot, cross-domain few-shot adaptation, within-source evaluation, and cross-source transfer to the Al-Wajh dataset. Experiments with state-of-the-art vision-language models (VLMs), multimodal large language models (MLLMs), and vision-only backbones reveal substantial degradation in zero-shot and extremely few-shot regimes, while adaptation with in-domain supervision yields large gains yet still leaves a persistent gap under cross-source shift and on long-tail genera. These results highlight fundamental challenges in applying general-purpose multimodal models to biodiversity monitoring and underscore the importance of large-scale, taxonomically grounded, high-quality datasets. ReefNet serves as both a benchmark and a training resource for advancing fine-grained coral reef understanding.

2510.14630 2026-04-22 cs.CV

Adapting Self-Supervised Representations as a Latent Space for Efficient Generation

Ming Gui, Johannes Schusterbauer, Timy Phan, Felix Krause, Josh Susskind, Miguel Angel Bautista, Björn Ommer

Comments ICLR 2026, Code: https://github.com/CompVis/RepTok

详情
英文摘要

We introduce Representation Tokenizer (RepTok), a generative modeling framework that represents an image using a single continuous latent token obtained from self-supervised vision transformers. Building on a pre-trained SSL encoder, we fine-tune only the semantic token embedding and pair it with a generative decoder trained jointly using a standard flow matching objective. This adaptation enriches the token with low-level, reconstruction-relevant details, enabling faithful image reconstruction. To preserve the favorable geometry of the original SSL space, we add a cosine-similarity loss that regularizes the adapted token, ensuring the latent space remains smooth and suitable for generation. Our single-token formulation resolves spatial redundancies of 2D latent spaces and significantly reduces training costs. Despite its simplicity and efficiency, RepTok achieves competitive results on class-conditional ImageNet generation and naturally extends to text-to-image synthesis, reaching competitive zero-shot performance on MS-COCO under extremely limited training budgets. Our findings highlight the potential of fine-tuned SSL representations as compact and effective latent spaces for efficient generative modeling.

2510.09204 2026-04-22 cs.RO cs.LG

Flow-Opt: Scalable Centralized Multi-Robot Trajectory Optimization with Flow Matching and Differentiable Optimization

Simon Idoko, Prajyot Jadhav, Arun Kumar Singh

详情
英文摘要

Centralized trajectory optimization in the joint space of multiple robots allows access to a larger feasible space that can result in smoother trajectories, especially while planning in tight spaces. Unfortunately, it is often computationally intractable beyond a very small swarm size. In this paper, we propose Flow-Opt, a learning-based approach towards improving the computational tractability of centralized multi-robot trajectory optimization. Specifically, we reduce the problem to first learning a generative model to sample different candidate trajectories and then using a learned Safety-Filter(SF) to ensure fast inference-time constraint satisfaction. We propose a flow-matching model with a diffusion transformer (DiT) augmented with permutation invariant robot position and map encoders as the generative model. We develop a custom solver for our SF and equip it with a neural network that predicts context-specific initialization. The initialization network is trained in a self-supervised manner, taking advantage of the differentiability of the SF solver. We advance the state-of-the-art in the following respects. First, we show that we can generate trajectories of tens of robots in cluttered environments in a few tens of milliseconds. This is several times faster than existing centralized optimization approaches. Moreover, our approach also generates smoother trajectories orders of magnitude faster than competing baselines based on diffusion models. Second, each component of our approach can be batched, allowing us to solve a few tens of problem instances in a fraction of a second. We believe this is a first such result; no existing approach provides such capabilities. Finally, our approach can generate a diverse set of trajectories between a given set of start and goal locations, which can capture different collision-avoidance behaviors.

2510.04800 2026-04-22 cs.CL

Hybrid Architectures for Language Models: Systematic Analysis and Design Insights

Sangmin Bae, Bilge Acun, Chien-Yu Lin, Haroun Habeeb, Seungyeon Kim, Liang Luo, Junjie Wang, Carole-Jean Wu

Comments 41 pages, 8 figures, 22 tables;

详情
英文摘要

Recent progress in large language models demonstrates that hybrid architectures--combining self-attention mechanisms with structured state space models like Mamba--can achieve a compelling balance between modeling quality and computational efficiency, particularly for long-context tasks. While these hybrid models show promising performance, systematic comparisons of hybridization strategies and analyses on the key factors behind their effectiveness have not been clearly shared to the community. In this work, we present a holistic evaluation of hybrid architectures based on inter-layer (sequential) or intra-layer (parallel) fusion. We comprehensively evaluate these designs across multiple dimensions: language modeling and downstream task performance, long-context capabilities, scaling analysis, and training and inference efficiency. By investigating the core characteristics of their computational primitive, we identify the most critical elements for each hybridization strategy and further propose optimal design recipes for hybrid models. Our comprehensive analysis provides practical guidance and valuable insights for developing hybrid language models, facilitating the optimization of architectural configurations.

2509.24803 2026-04-22 cs.LG cs.AI

TimeOmni-1: Incentivizing Complex Reasoning with Time Series in Large Language Models

Tong Guan, Zijie Meng, Dianqi Li, Shiyu Wang, Chao-Han Huck Yang, Qingsong Wen, Zuozhu Liu, Sabato Marco Siniscalchi, Ming Jin, Shirui Pan

Comments Accepted by the 14th International Conference on Learning Representations (ICLR 2026)

详情
英文摘要

Recent advances in multimodal time series learning underscore a paradigm shift from analytics centered on basic patterns toward advanced time series understanding and reasoning. However, existing multimodal time series datasets mostly remain at the level of surface alignment and question answering, without reaching the depth of genuine reasoning. The absence of well-defined tasks that genuinely require time series reasoning, along with the scarcity of high-quality data, has limited progress in building practical time series reasoning models (TSRMs). To this end, we introduce Time Series Reasoning Suite (TSR-Suite), which formalizes four atomic tasks that span three fundamental capabilities for reasoning with time series: (1) perception, acquired through scenario understanding and causality discovery; (2) extrapolation, realized via event-aware forecasting; and (3) decision-making, developed through deliberation over perception and extrapolation. TSR-Suite is the first comprehensive time series reasoning suite that supports not only thorough evaluation but also the data pipeline and training of TSRMs. It contains more than 23K samples, of which 2.3K are carefully curated through a human-guided hierarchical annotation process. Building on this foundation, we introduce TimeOmni-1, the first unified reasoning model designed to address diverse real-world problems demanding time series reasoning. The model is trained in multiple stages, integrating a mixture of task scenarios, novel reward functions, and tailored optimizations. Experiments show that TimeOmni-1 delivers strong out-of-distribution generalization across all tasks and achieves a high rate of valid responses. It significantly improves causality discovery accuracy (64.0% vs. 35.9% with GPT-4.1) and raises the valid response rate by over 6% compared to GPT-4.1 on the event-aware forecasting task.

2509.07966 2026-04-22 cs.CV cs.CL

Visual-TableQA: Open-Domain Benchmark for Reasoning over Table Images

Boammani Aser Lompo, Marc Haraoui

Comments Accepted at the First Workshop on Foundations of Reasoning in Language Models, NeurIPS 2025. Available at: https://openreview.net/forum?id=fvJRsGwhPf

详情
英文摘要

Visual reasoning over structured data such as tables is a critical capability for modern vision-language models (VLMs), yet current benchmarks remain limited in scale, diversity, or reasoning depth, especially when it comes to rendered table images. Addressing this gap, we introduce Visual-TableQA, a large-scale, open-domain multimodal dataset specifically designed to evaluate and enhance visual reasoning over complex tabular data. Our generation pipeline is modular, scalable, and fully autonomous, involving multiple reasoning LLMs collaborating across distinct roles: generation, validation, and inspiration. Visual-TableQA comprises 2.5k richly structured LaTeX-rendered tables and 6k reasoning-intensive QA pairs, all produced at a cost of under USD 100. To promote diversity and creativity, our pipeline performs multi-model collaborative data generation via cross-model prompting ('inspiration') and LLM-jury filtering. Stronger models seed layouts and topics that weaker models elaborate, collectively distilling diverse reasoning patterns and visual structures into the dataset. Empirical results show that models fine-tuned on Visual-TableQA generalize robustly to external benchmarks, outperforming several proprietary models despite the dataset's synthetic nature. The full pipeline and resources are publicly available at https://github.com/AI-4-Everyone/Visual-TableQA.

2508.14170 2026-04-22 cs.CL cs.CY

Comparing energy consumption and accuracy in text classification inference

Johannes Zschache, Tilman Hartwig

Comments Key results in Figure 2, accepted in Nature Sci Rep, 32 pages

详情
Journal ref
Sci Rep 16, 12717 (2026)
英文摘要

The increasing deployment of large language models (LLMs) in natural language processing (NLP) tasks raises concerns about energy efficiency and sustainability. While prior research has largely focused on energy consumption during model training, the inference phase has received comparatively less attention. This study systematically evaluates the trade-offs between model accuracy and energy consumption in text classification inference across various model architectures and hardware configurations. Our empirical analysis shows that in some contexts the best-performing model in terms of accuracy can also be energy-efficient. While LLMs tend to consume significantly more energy than traditional machine learning models, they show the same or even lower levels of accuracy in our zero-shot classification setting. We observe substantial variability in inference energy consumption ($<$mWh to $>$kWh), influenced by model type, model size, and hardware specifications. Additionally, we find a strong correlation between inference energy consumption and model runtime, indicating that execution time can serve as a practical proxy for energy usage in settings where direct measurement is not feasible. Our findings demonstrate that energy efficiency and accuracy represent distinct evaluation dimensions that do not necessarily align. We argue that sustainable AI development requires systematic evaluation of both performance and resource efficiency.

2508.12121 2026-04-22 cs.LG math.DS

Time-Scale Coupling Between States and Parameters in Recurrent Neural Networks

Lorenzo Livi

Comments final version

详情
英文摘要

We show that gating mechanisms in recurrent neural networks (RNNs) induce lag-dependent and direction-dependent effective learning rates, even when training uses a fixed, global step size. This behavior arises from a coupling between state-space time-scales (parametrized by the gates) and parameter-space dynamics during gradient descent. By deriving exact Jacobians for leaky-integrator and gated RNNs and applying a first-order expansion, we make explicit how constant, scalar, and multi-dimensional gates reshape gradient propagation, modulate effective step sizes, and introduce anisotropy in parameter updates. These findings reveal that gates act not only as filters of information flow, but also as data-driven preconditioners of optimization, with formal connections to learning-rate schedules, momentum, and adaptive methods such as Adam. Empirical simulations corroborate these predictions: across several sequence tasks, gates produce lag-dependent effective learning rates and concentrate gradient flow into low-dimensional subspaces, matching or exceeding the anisotropic structure induced by Adam. Notably, gating and optimizer-driven adaptivity shape complementary aspects of credit assignment: gates align state-space transport with loss-relevant directions, while optimizers rescale parameter-space updates. Overall, this work provides a unified dynamical systems perspective on how gating couples state evolution with parameter updates, clarifying why gated architectures achieve robust trainability in practice.

2508.04818 2026-04-22 cs.CV eess.IV stat.ML

Single-Step Reconstruction-Free Anomaly Detection and Segmentation via Diffusion Models

Mehrdad Moradi, Marco Grasso, Bianca Maria Colosimo, Kamran Paynabar

Comments 9 pages, 8 figures, 1 table. Accepted to 2025 International Conference on Machine Learning and Applications (ICMLA)

详情
Journal ref
Proc. 2025 International Conference on Machine Learning and Applications (ICMLA), Boca Raton, FL, USA, 2025, pp. 663-670
英文摘要

Generative models have demonstrated significant success in anomaly detection and segmentation over the past decade. Recently, diffusion models have emerged as a powerful alternative, outperforming previous approaches such as GANs and VAEs. In typical diffusion-based anomaly detection, a model is trained on normal data, and during inference, anomalous images are perturbed to a predefined intermediate step in the forward diffusion process. The corresponding normal image is then reconstructed through iterative reverse sampling. However, reconstruction-based approaches present three major challenges: (1) the reconstruction process is computationally expensive due to multiple sampling steps, making real-time applications impractical; (2) for complex or subtle patterns, the reconstructed image may correspond to a different normal pattern rather than the original input; and (3) Choosing an appropriate intermediate noise level is challenging because it is application-dependent and often assumes prior knowledge of anomalies, an assumption that does not hold in unsupervised settings. We introduce Reconstruction-free Anomaly Detection with Attention-based diffusion models in Real-time (RADAR), which overcomes the limitations of reconstruction-based anomaly detection. Unlike current SOTA methods that reconstruct the input image, RADAR directly produces anomaly maps from the diffusion model, improving both detection accuracy and computational efficiency. We evaluate RADAR on real-world 3D-printed material and the MVTec-AD dataset. Our approach surpasses state-of-the-art diffusion-based and statistical machine learning models across all key metrics, including accuracy, precision, recall, and F1 score. Specifically, RADAR improves F1 score by 7% on MVTec-AD and 13% on the 3D-printed material dataset compared to the next best model. Code available at: https://github.com/mehrdadmoradi124/RADAR

2508.03337 2026-04-22 cs.CV

Less is More: Token-Efficient Video-QA via Adaptive Frame-Pruning and Semantic Graph Integration

Shaoguang Wang, Weiyu Guo, Ziyang Chen, Yijie Xu, Xuming Hu, Hui Xiong

Comments Accepted to CVPR 2026 Findings

详情
Journal ref
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Findings, 2026
英文摘要

The practical application of Multimodal Large Language Models (MLLMs) to Video Question Answering (Video-QA) is severely hindered by the high token cost of processing numerous video frames. While keyframe selection is the dominant strategy for mitigating this, we identify a critical flaw: even state-of-the-art selectors produce prompts suffering from significant temporal redundancy, a challenge unique to video that we term 'visual echoes'. This issue leads to context dilution and can paradoxically degrade performance. To address this dual challenge, we propose a novel refinement framework that synergistically combines Adaptive Frame-Pruning(AFP) with a lightweight text-based semantic graph. AFP intelligently prunes 'visual echoes' by adaptively clustering frames, while the semantic graph provides crucial, low-cost semantic compensation. Conducting extensive experiments on the LongVideoBench and Video-MME benchmarks against multiple state-of-the-art selectors, our approach demonstrates a drastic reduction in total input tokens by up to 82.2%. Crucially, by creating a concise, high-quality prompt, our framework not only enhances efficiency but also demonstrates a remarkable ability to robustify and improve the accuracy of upstream selectors, achieving results that are highly competitive with, and often superior to, baselines that use vastly more frames.

2508.02384 2026-04-22 cs.CV

SMART-Ship: A Comprehensive Synchronized Multi-modal Aligned Remote Sensing Targets Dataset and Benchmark for Berthed Ships Analysis

Chen-Chen Fan, Peiyao Guo, Linping Zhang, Kehan Qi, Haolin Huang, Yong-Qiang Mao, Yuxi Suo, Zhizhuo Jiang, Yu Liu, You He

详情
英文摘要

Given the limitations of satellite orbits and imaging conditions, multi-modal remote sensing (RS) data is crucial in enabling long-term earth observation. However, maritime surveillance remains challenging due to the complexity of multi-scale targets and the dynamic environments. To bridge this critical gap, we propose a Synchronized Multi-modal Aligned Remote sensing Targets dataset for berthed ships analysis (SMART-Ship), containing spatiotemporal registered images with fine-grained annotation for maritime targets from five modalities: visible-light, synthetic aperture radar (SAR), panchromatic, multi-spectral, and near-infrared. Specifically, our dataset consists of 1092 multi-modal image sets, covering 38,838 ships. Each image set is acquired within one week and registered to ensure spatiotemporal consistency. Ship instances in each set are annotated with polygonal location information, fine-grained categories, instance-level identifiers, and change region masks, organized hierarchically to support diverse multi-modal RS tasks. Furthermore, we define standardized benchmarks on five fundamental tasks and comprehensively compare representative methods across the dataset. Thorough experiment evaluations validate that the proposed SMART-Ship dataset could support various multi-modal RS interpretation tasks and reveal the promising directions for further exploration.

2508.01959 2026-04-22 cs.CL

SitEmb-v1.5: Improved Context-Aware Dense Retrieval for Semantic Association and Long Story Comprehension

Junjie Wu, Jiangnan Li, Yuqing Li, Lemao Liu, Liyan Xu, Jiwei Li, Dit-Yan Yeung, Jie Zhou, Mo Yu

Comments ACL 2025 Main Conference. Our trained models can be downloaded from: https://huggingface.co/SituatedEmbedding

详情
英文摘要

Retrieval-augmented generation (RAG) over long documents typically involves splitting the text into smaller chunks, which serve as the basic units for retrieval. However, due to dependencies across the original document, contextual information is often essential for accurately interpreting each chunk. To address this, prior work has explored encoding longer context windows to produce embeddings for longer chunks. Despite these efforts, gains in retrieval and downstream tasks remain limited. This is because (1) longer chunks strain the capacity of embedding models due to the increased amount of information they must encode, and (2) many real-world applications still require returning localized evidence due to constraints on model or human bandwidth. We propose an alternative approach to this challenge by representing short chunks in a way that is conditioned on a broader context window to enhance retrieval performance -- i.e., situating a chunk's meaning within its context. We further show that existing embedding models are not well-equipped to encode such situated context effectively, and thus introduce a new training paradigm and develop the situated embedding models (SitEmb). To evaluate our method, we curate a book-plot retrieval dataset specifically designed to assess situated retrieval capabilities. On this benchmark, our SitEmb-v1 model based on BGE-M3 substantially outperforms state-of-the-art embedding models, including several with up to 7-8B parameters, with only 1B parameters. Our 8B SitEmb-v1.5 model further improves performance by over 10% and shows strong results across different languages and several downstream applications.

2507.09861 2026-04-22 cs.CV cs.AI

A Survey on MLLM-based Visually Rich Document Understanding: Methods, Challenges, and Emerging Trends

Yihao Ding, Siwen Luo, Yue Dai, Yanbei Jiang, Zechuan Li, Qiang Sun, Geoffrey Martin, Wei Liu, Yifan Peng

Comments Accepted at ACL 2026 Findings

详情
英文摘要

Visually Rich Document Understanding (VRDU) has become a pivotal area of research, driven by the need to automatically interpret documents that contain intricate visual, textual, and structural elements. Recently, Multimodal Large Language Models (MLLMs) have demonstrated significant promise in this domain, including both OCR-based and OCR-free approaches for information extraction from document images. This survey reviews recent advances in MLLM-based VRDU, highlighting emerging trends and promising research directions with a focus on two key aspects: (1) techniques for representing and integrating textual, visual, and layout features; (2) training paradigms, including pretraining, instruction tuning, and training strategies. Moreover, we address challenges such as data scarcity, handling multi-page and multilingual documents, and integrating emerging trends such as Retrieval-Augmented Generation and agentic frameworks. Our analysis offers a roadmap for advancing MLLM-based VRDU toward more scalable, reliable, and adaptable systems.

2506.22174 2026-04-22 cs.RO cs.LG

ASVSim (AirSim for Surface Vehicles): A High-Fidelity Simulation Framework for Autonomous Surface Vehicle Research

Bavo Lesy, Siemen Herremans, Robin Kerstens, Jan Steckel, Walter Daems, Siegfried Mercelis, Ali Anwar

Comments 18 Pages, 13 Figures. Accepted at IEEE ACCESS

详情
英文摘要

The transport industry has recently shown significant interest in unmanned surface vehicles (USVs), specifically for port and inland waterway transport. These systems can improve operational efficiency and safety, which is especially relevant in the European Union, where initiatives such as the Green Deal are driving a shift towards increased use of inland waterways. At the same time, a shortage of qualified personnel is accelerating the adoption of autonomous solutions. However, there is a notable lack of open-source, high-fidelity simulation frameworks and datasets for developing and evaluating such solutions. To address these challenges, we introduce AirSim for Surface Vehicles (ASVSim), an open-source simulation framework specifically designed for autonomous shipping research in inland and port environments. The framework combines simulated vessel dynamics with marine sensor simulation capabilities, including radar and camera systems and supports the generation of synthetic datasets for training computer vision models and reinforcement learning (RL) agents. Built upon Cosys-AirSim, ASVSim provides a comprehensive platform for developing autonomous navigation algorithms and generating synthetic datasets. The simulator supports research of both traditional control methods and deep learning-based approaches. Through experiments in waterway segmentation and autonomous navigation, we demonstrate the capabilities of the simulator in these research areas. ASVSim is provided as an open-source project under the MIT license, making autonomous navigation research accessible to a larger part of the ocean engineering community. See https://github.com/BavoLesy/ASVSim.

2506.18871 2026-04-22 cs.CV cs.AI cs.CL

OmniGen2: Towards Instruction-Aligned Multimodal Generation

Chenyuan Wu, Pengfei Zheng, Ruiran Yan, Shitao Xiao, Xin Luo, Yueze Wang, Wanli Li, Xiyan Jiang, Yexin Liu, Junjie Zhou, Ze Liu, Ziyi Xia, Chaofan Li, Haoge Deng, Jiahao Wang, Kun Luo, Bo Zhang, Defu Lian, Xinlong Wang, Zhongyuan Wang, Tiejun Huang, Zheng Liu

详情
英文摘要

In this work, we introduce OmniGen2, a versatile and open-source generative model designed to provide a unified solution for diverse generation tasks, including text-to-image, image editing, and in-context generation. Unlike OmniGen v1, OmniGen2 features two distinct decoding pathways for text and image modalities, utilizing unshared parameters and a decoupled image tokenizer. This design enables OmniGen2 to build upon existing multimodal understanding models without the need to re-adapt VAE inputs, thereby preserving the original text generation capabilities. To facilitate the training of OmniGen2, we developed comprehensive data construction pipelines, encompassing image editing and in-context generation data. Additionally, we introduce a reflection mechanism tailored for image generation tasks and curate a dedicated reflection dataset based on OmniGen2. Despite its relatively modest parameter size, OmniGen2 achieves competitive results on multiple task benchmarks, including text-to-image and image editing. To further evaluate in-context generation, also referred to as subject-driven tasks, we introduce a new benchmark named OmniContext. OmniGen2 achieves state-of-the-art performance among open-source models in terms of consistency. We will release our models, training code, datasets, and data construction pipeline to support future research in this field. Project Page: https://vectorspacelab.github.io/OmniGen2; GitHub Link: https://github.com/VectorSpaceLab/OmniGen2

2506.09373 2026-04-22 cs.LG cs.AI cs.CV

LPO: Towards Accurate GUI Agent Interaction via Location Preference Optimization

Jiaqi Tang, Yu Xia, Yi-Feng Wu, Yuwei Hu, Yuhui Chen, Qing-Guo Chen, Xiaogang Xu, Xiangyu Wu, Hao Lu, Yanqing Ma, Shiyin Lu, Qifeng Chen

Comments Accepted by ACL 2026 Findings

详情
英文摘要

The advent of autonomous agents is transforming interactions with Graphical User Interfaces (GUIs) by employing natural language as a powerful intermediary. Despite the predominance of Supervised Fine-Tuning (SFT) methods in current GUI agents for achieving spatial localization, these methods face substantial challenges due to their limited capacity to accurately perceive positional data. Existing strategies, such as reinforcement learning, often fail to assess positional accuracy effectively, thereby restricting their utility. In response, we introduce Location Preference Optimization (LPO), a novel approach that leverages locational data to optimize interaction preferences. LPO uses information entropy to predict interaction positions by focusing on zones rich in information. Besides, it further introduces a dynamic location reward function based on physical distance, reflecting the varying importance of interaction positions. Supported by Group Relative Preference Optimization (GRPO), LPO facilitates an extensive exploration of GUI environments and significantly enhances interaction precision. Comprehensive experiments demonstrate LPO's superior performance, achieving SOTA results across both offline benchmarks and real-world online evaluations. Our code will be made publicly available soon, at https://github.com/jqtangust/LPO.

2505.16662 2026-04-22 cs.RO eess.SP

Joint Magnetometer-IMU Calibration via Maximum A Posteriori Estimation

Chuan Huang, Gustaf Hendeby, Isaac Skog

Comments Accepted version

详情
英文摘要

This paper presents a new method for jointly calibrating a magnetometer and inertial measurement unit (IMU), focusing on balancing calibration accuracy and computational efficiency. The proposed method is based on a maximum a posteriori estimation framework, treating both the calibration parameters and orientation trajectory of the sensors as unknowns. This method enables efficient optimization of the calibration parameters using analytically derived derivatives. The performance of the proposed method is compared against that of two state-of-the-art methods. Simulation results demonstrate that the proposed method achieves the lowest root mean square error in calibration parameters, increasing the calibration accuracy by 20-30%, while maintaining competitive computational efficiency. Further validation through real-world experiments confirms the practical benefits of the proposed method. The proposed method calibrated 30 magnetometer-IMU pairs in under two minutes on a consumer-grade laptop, which is one order of magnitude faster than the most accurate state-of-the-art algorithm as implemented in this work. Moreover, when calibrated using the proposed method, a magnetic-field-aided inertial navigation system achieved positioning performance comparable to when it is calibrated with the state-of-the-art method. These results demonstrate that the proposed method is a reliable and effective choice for jointly calibrating magnetometer-IMU pairs.