arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 1107
2605.01591 2026-05-05 cs.IR cs.CL

Led to Mislead: Adversarial Content Injection for Attacks on Neural Ranking Models

Amin Bigdeli, Amir Khosrojerdi, Radin Hamidi Rad, Morteza Zihayat, Charles L. A. Clarke, Ebrahim Bagheri

详情
英文摘要

Neural Ranking Models (NRMs) are central to modern information retrieval but remain highly vulnerable to adversarial manipulation. Existing attacks often rely on heuristics or surrogate models, limiting effectiveness and transferability. We propose CRAFT, a supervised framework for black-box adversarial rank attacks powered by large language models (LLMs). CRAFT operates in three stages: adversarial dataset generation via retrieval-augmented generation and self-refinement, supervised fine-tuning on curated adversarial examples, and preference-guided optimization to align generations with rank-promotion objectives. Extensive experiments on the MS MARCO passage dataset, TREC Deep Learning 2019, and TREC Deep Learning 2020 benchmarks show that CRAFT significantly outperforms state-of-the-art baselines, achieving higher promotion rates and rank boosts while preserving fluency and semantic fidelity. Moreover, CRAFT transfers effectively across diverse ranking architectures, including cross-encoder, embedding-based, and LLM-based rankers, underscoring vulnerabilities in real-world retrieval systems. This work provides a principled framework for studying adversarial threats in NRMs, underscores the risks of generative AI in rank manipulation, and provides a foundation for developing more robust retrieval systems. To support reproducibility, we publicly release our source code, trained models, and prompt templates.

2605.01582 2026-05-05 cs.IR cs.AI

KG-First, LLM-Fallback: A Hybrid Microservice for Grounded Skill Search and Explanation

Ngoc Luyen Le, Marie-Hélène Abel, Bertrand Laforge

详情
英文摘要

Authoritative competency frameworks such as ESCO, ROME, and O*NET are essential for aligning education with labor market needs, yet their technical complexity and structural heterogeneity hinder practical adoption by educators. This paper introduces SkillGraph-Service, an interoperable microservice designed to bridge this gap by unifying these resources into a provenance-preserving Knowledge Graph (KG). Adopting a KG-first, LLM-fallback architecture, the system combines symbolic rigor with sub-symbolic flexibility. It implements a lightweight hybrid retrieval engine (fusing SQLite FTS5 and HNSW vector search) to handle the vocabulary mismatch in educator queries, and utilizes Large Language Models (LLMs) strictly for constrained ranking and audience-aware explanation. Empirical evaluation on a multilingual dataset reveals that the proposed hybrid strategy achieves superior retrieval effectiveness (nDCG@5>0.94) with sub-200 ms latency, rendering computationally expensive cross-encoder re-ranking may be unnecessary for this domain. Furthermore, an analysis of generated explanations highlights a trade-off between fluency and faithfulness: while JSON-constrained LLMs ensure high citation precision, deterministic templates remain the most reliable method for maximizing evidence coverage. The resulting architecture offers a practical, scalable, and auditable solution for integrating complex skill data into digital learning ecosystems.

2605.01579 2026-05-05 stat.ME cs.LG

Minimum Specification Perturbation: Robustness as Distance-to-Falsification in Causal Inference

Hoang Dang, Luan Pham, Minh Nguyen

Comments 36 pages, 2 figures

详情
英文摘要

Empirical causal claims depend on many analyst decisions, from selecting covariates to choosing estimators. Existing robustness tools summarize how results vary across these choices, but, to the best of our knowledge, do not answer: \textbf{How many analyst decisions must change to reach a specification, which is a set of choices, whose confidence interval (CI) contains zero?} We introduce \emph{Minimum Specification Perturbation (MSP)}, the smallest number of changes. MSP is small under the null, grows with effect strength and captures distance-to-falsification information that dispersion-based summaries cannot report; when making decisions under weak effects, an MSP-based rule yields lower false-positive rates than dispersion-based rules. We show that Fragility Index and MSP measure orthogonal vulnerabilities: fragility to influential observations need not imply fragility to specification choices. On the LaLonde benchmark, MSP = 1 implies that one decision change makes the CI contain zero. We further provide exact permutation calibration under randomization and characterize computation, showing tractable cases under additive structure and NP-hardness in general.

2605.01567 2026-05-05 cs.SE cs.CL cs.LG

Feedback-Normalized Developer Memory for Reinforcement-Learning Coding Agents: A Safety-Gated MCP Architecture

Mehmet Iscan

Comments 25 pages, 5 figures, 7 tables. Preprint. Implementation and supplementary artifacts are available at the project repository

详情
英文摘要

Large language model (LLM) coding agents increasingly operate over repositories, terminals, tests, and execution traces across long software-engineering episodes. Persistent memory is useful, but static vector stores or generic retrieval-augmented generation (RAG) are insufficient for reinforcement-learning (RL) code development, where small details can alter Bellman targets, terminal masks, gradient flow, or validation claims. This paper presents RL Developer Memory, a local-first, Model Context Protocol (MCP)-native developer-memory architecture for RL coding agents. It treats memory selection as a logged contextual decision process: issue_match ranks candidates and records telemetry, issue_feedback maps raw labels to bounded rewards, and issue_record_resolution links verified resolutions to earlier retrieval events. A deterministic ranker remains deployed, while a contextual-bandit residual policy runs in shadow mode and can affect canary behavior only through conservative off-policy-evaluation (OPE) gates. RL/control memories require theory-to-code metadata and review-gated governance. The system is evaluated on a deterministic 200-case benchmark with RL algorithm bugs, hard negatives, review-gated RL/control cases, and low-risk failures. In the same-commit comparison, deterministic control and full shadow/OPE both achieve 80.0% expected-decision accuracy and 100.0% hard-negative suppression; the full configuration adds learning telemetry rather than accuracy gain. Static validation passed 11/11 checks; dynamic integration passed 10/10 cases. The evidence reports limits: active learned-policy deployment and official-client MCP interoperability are unsupported, live full-configuration latency regresses, and 40 residual non-RL failures remain. The contribution is an auditable memory-control architecture with explicit claim boundaries, not a universal coding-agent improvement claim.

2605.01562 2026-05-05 cs.SE cs.AI

Neuro-Symbolic Agents for Hallucination-Free Requirements Reuse

Ahmed Ibrahim

详情
英文摘要

The Object-Oriented Method for Requirements Authoring and Management (OOMRAM) is a requirements reuse framework that relies on exact identifier matching and rigid templates, limiting its ability to adapt specifications across diverse contexts. While Large Language Models (LLMs) offer the flexibility to overcome this bottleneck, they introduce the risk of generating structurally invalid or inconsistent requirement combinations. To address this tension, we present a neuro-symbolic multi-agent system that re-conceptualizes requirements reuse as a \textbf{Model-Driven Elicitation process}. In this paradigm, an LLM serves as a \textbf{non-deterministic heuristic} for traversing a \textbf{deterministic domain model} represented by a formal OOMRAM requirement lattice. A deterministic, symbolic validator enforces all structural constraints within the agent loop, effectively eliminating hallucinated requirement combinations by construction. Evaluated on an autonomous benchmark across two application families, our system achieves 100\% requirement coverage and a constraint-violation rate of only 0.2\%. Although the F1-score against a single gold standard is moderate (0.47--0.51), every generated specification is structurally valid and satisfies all mandatory domain requirements. The model-agnostic implementation scales to larger lattices via subgraph navigation and provides transparent audit trails for regulatory compliance.

2605.01561 2026-05-05 econ.EM cs.LG physics.soc-ph

Hall-Like Transversal Stress and Sandpile Criticality on Real Production Networks

Diego Vallarino

详情
英文摘要

This paper develops a Hall-Sandpile model of economic instability that combines a Hall-like transversal stress mechanism with sandpile threshold dynamics on a real production-network substrate. In analogy with the physical Hall effect, where exposed flows under an external field generate stress in a transversal direction, we model economic shocks as fields that act on flow-intensive, low-redundancy, low-capacity nodes and produce systemic stress through a multiplicative conversion function. The accumulated stress drives a discrete toppling rule and an avalanche dynamics whose effective activation threshold declines with transversal exposure. The model is calibrated on annual World Input--Output Database (WIOD) production networks for 2000--2014 and simulated on the 2014 substrate (2{,}283 country--sector nodes) under three alternative propagation normalisations to avoid mechanical near-criticality from row-stochastic operators. Controlled Monte Carlo experiments over external field intensity and redundancy stress generate four ordered regimes: stable absorption, latent fragility, critical transition, and avalanche regime. Mean avalanche size and the probabilities of finite-size systemic events $\Pr(S\!\geq\!5)$, $\Pr(S\!\geq\!10)$ and $\Pr(S\!\geq\!20)$ rise jointly with field intensity and redundancy stress. Tail diagnostics show regime-dependent thickening of the avalanche distribution, but the estimated tail indices remain too high to interpret as evidence of universal power-law criticality. The contribution is therefore a finite-size, real-network description of how transversal stress activates structural fragility, not a claim of self-organised criticality in the global economy.

2605.01546 2026-05-05 cs.NI cs.AI

6G Needs Agents: Toward Agentic AI-Native Networks for Autonomous Intelligence

Mohamed Amine Ferrag, Abderrahmane Lakas, Merouane Debbah

详情
英文摘要

Sixth-generation (6G) networks are increasingly envisioned as AI-native infrastructures integrating communication, sensing, and computing into a unified fabric. However, existing approaches remain largely optimization-centric, relying on closed-loop control with limited reasoning capability. In this paper, we argue for a paradigm shift toward Agentic AI-Native 6G, in which Large Language Model (LLM)-based agents operate as bounded, policy-governed reasoning entities within a semantic control plane layered above deterministic 3GPP infrastructure. We propose a four-layer architecture that integrates deterministic network infrastructure, semantic abstraction of intent and context, hierarchical reasoning, and a distributed multi-agent fabric spanning device, edge, and core domains. To assess feasibility, we develop a proof-of-concept agentic reasoning and orchestration framework and conduct an extensive empirical study using a domain-specific 6G benchmark under realistic deployment constraints. Our results reveal a fundamental tradeoff between reasoning capability and system efficiency, showing that no single model simultaneously satisfies latency, throughput, and accuracy requirements. Instead, heterogeneous deployment of LLM agents across the device--edge--core continuum is necessary to balance these constraints. We further demonstrate that quantization introduces non-uniform effects across models, reinforcing the need for system-level optimization rather than model-level compression alone. These findings establish agentic intelligence as a viable architectural direction for 6G and highlight key challenges in achieving scalable, trustworthy, and self-reasoning networks. All experimental results and evaluation scripts are publicly available to support reproducibility.

2605.01492 2026-05-05 stat.ML cs.IT cs.LG math.IT

Stabilizing Private LASSO under Heterogeneous Covariates via Anisotropic Objective Perturbation

Haruka Tanzawa, Ayaka Sakata

Comments 6 pages, 5 figures

详情
英文摘要

We study high-dimensional LASSO under differential privacy via objective perturbation with heterogeneous covariate scales. In practical scenarios, covariates often exhibit diverse scales; however, standard preprocessing is problematic under privacy constraints, as it consumes additional privacy budget. This heterogeneity induces effective anisotropy in the objective perturbation via the inverse Gram matrix of covariates, which can degrade the stability and accuracy of algorithms. To address this, we propose a Gram-based anisotropic objective perturbation, a ``pre-distortion" strategy that counteracts the distortion from the covariate structure to restore isotropy in the estimation process. Using an Approximate Message Passing (AMP) framework and state evolution analysis, we demonstrate that our proposed perturbation significantly stabilizes convergence and improves both statistical efficiency and privacy performance compared to standard uniform noise injection. Our results provide theoretical insights into designing stable and efficient private estimators without relying on data-dependent preprocessing.

2605.01471 2026-05-05 cs.SE cs.AI

Practical Limits of Autonomous Test Repair: A Multi-Agent Case Study with LLM-Driven Discovery and Self-Correction

Hyukjoo Lee

Comments Industrial case study; submitted for review

详情
英文摘要

Maintaining reliable UI test suites in large-scale enterprise applications is a persistent and costly challenge. We present an industrial case study of a multi-agent autonomous testing system evaluated using anonymized execution data from a production-like enterprise UI testing prototype. The application features several hundred dynamic UI elements per screen. Built on a large language model with LangGraph orchestration, Playwright execution, and a RAG knowledge base, the system evolves from human-directed testing toward High-autonomy feature discovery and test execution: given no explicit test targets, it discovers over 100 testable features across 10 UI screens, dynamically expands coverage by an additional 15--30 features through runtime DOM analysis, and iteratively repairs failing tests without human intervention. We analyzed 300 consecutive autonomous execution reports encompassing 636 individual test-case executions across 10 distinct scenario families. The system achieved a 70% repair convergence rate at the scenario-family level, with a mean of 3.4 repair iterations to convergence. However, only 10% of scenario families succeeded on first attempt, 38% of reports failed to produce any executable test artifact, and we documented concrete instances of assertion weakening and test-case deletion used as workaround mechanisms to achieve superficial convergence. Our findings show that unrestricted autonomy leads to unstable and often misleading outcomes, while constrained autonomy transforms such systems into operationally viable workflows. Rather than advocating full autonomy, our findings suggest that reliable autonomous testing in enterprise-scale settings requires explicit constraints, validation boundaries, and human oversight to preserve semantic correctness and operational trustworthiness.

2605.01467 2026-05-05 math.OC cs.CV cs.NA math.NA

Quaternion Nonlinear Transform-Induced Nuclear Norm for Low-Rank Tensor Completion

Biswarup Karmakar, Ratikanta Behera

Comments 25 pages

详情
英文摘要

Tensor completion has emerged as a powerful framework for recovering missing data in multidimensional signals by exploiting low-rank tensor structures. Among existing approaches, linear transform-based tensor nuclear norm (TNN) methods have achieved considerable success by enforcing low-rankness on transformed frontal slices. However, the low-rank structure revealed by linear transforms remains inherently limited. To better capture intrinsic correlations, nonlinear transform-based TNN (NTTNN) models have been proposed, significantly enhancing low-rank representation through composite transforms. Despite their effectiveness, existing NTTNN methods are restricted to real-valued tensors and fail to model quaternion-valued data, which are essential for preserving inter-channel dependencies in color images and videos. Extending nonlinear TNN models to the quaternion domain is challenging due to the non-commutativity of quaternion multiplication and the complexity of quaternion singular value decomposition. To address the limitations encountered in prior works, we propose a quaternion nonlinear transform-induced tensor nuclear norm (QNTTNN) via a real embedding of quaternions, enabling tractable nuclear norm definitions and efficient optimization. Building upon QNTTNN, we formulate a quaternion tensor completion model and develop a proximal alternating minimization algorithm with rigorous convergence guarantees. Extensive experiments on benchmark color video inpainting datasets validate the superior performance of the proposed method over existing approaches.

2605.01452 2026-05-05 stat.ME cs.LG

Stable Localized Conformal Prediction via Transduction

Yinjie Min, Liuhua Peng, Changliang Zou

详情
英文摘要

Existing evaluations of conformal prediction, such as prediction efficiency and test-conditional coverage, are defined in expectation over the calibration data. In practice, when only one calibration set of limited size is available, prediction sets often exhibit high variability in size, especially for methods with localization. We formalize this concern as set stability, defined as the variance of the conditional expectation of the set size given the calibration data. To improve stability without requiring additional target-task labels, we propose Stable Conformal Prediction (StCP), a transfer learning approach that utilizes labeled source-task data and unlabeled target data. Theoretically, we characterize the marginal coverage and stability of StCP; empirically, it delivers more stable prediction sets than standard conformal prediction methods, especially for those with localization, when calibration data are limited.

2605.01449 2026-05-05 cs.CR cs.AI

VisInject: Disruption != Injection -- A Dual-Dimension Evaluation of Universal Adversarial Attacks on Vision-Language Models

Pang Liu, Yingjie Lao

详情
英文摘要

Universal adversarial attacks on aligned multimodal large language models are increasingly reported with attack success rates in the 60-80% range, suggesting the visual modality is highly vulnerable to imperceptible perturbations as a prompt-injection channel. We argue that this number conflates two distinct events: (i) the model's output was perturbed (Influence), and (ii) the attacker's chosen target concept was actually emitted (Precise Injection). We compose two existing techniques -- Universal Adversarial Attack and AnyAttack -- under an $L_{inf}$ budget of 16/255, and we add a dual-axis evaluation: a deterministic Ratcliff-Obershelp drift score for Influence (programmatic baseline) plus a 4-tier ordinal categorical none/weak/partial/confirmed for Precise Injection. The judge is DeepSeek-V4-Pro in thinking mode, calibrated against Claude Opus 4.7 with Cohen's $κ$ = 0.77 on the injection axis (substantial agreement); the entire 4475-entry SHA-256 input cache ships with the dataset so reviewers can re-derive paper numbers bit-exact without an API key. Across 6615 pairs over four open VLMs, seven attack prompts, and seven test images, the two axes diverge by roughly 90$\times$: 66.4% of pairs are programmatically disturbed (LLM-judged 46.6% at the substantial-or-complete tier), but only 0.756% (50/6615) reach any non-none injection tier and only 0.030% (2/6615) verbatim. The few injections that do land cluster on screenshot- or document-style carriers whose semantics already invite text transcription. BLIP-2 shows \emph{zero detectable drift} at $L_{inf}$ = 16/255 across all 2205 pairs even when used as a Stage-1 surrogate. We release the full dataset -- 21 universal images, 147 adversarial photos, 6,615 response pairs, the v3 dual-axis judge results, and the cache at huggingface.co/datasets/jeffliulab/visinject.

2605.01423 2026-05-05 hep-ex cs.AI cs.MA

HepScript: A Dual-Use DSL for Human-AI Collaborative Data Analysis Workflows in High-Energy Physics

Junkun Jiao, Tong Liu, Ke Li, Weimin Song, Yipu Liao, Bolun Zhang, Beijiang Liu, Chang-Zheng Yuan, Yue Sun

详情
英文摘要

The escalating data scale in High-Energy Physics (HEP) fuels a growing aspiration for higher analytical efficiency. While Large Language Models (LLMs) offer a path toward automation via agentic AI, they struggle with complex scientific workflows that require deep domain knowledge and are tightly coupled to experiment-specific codebases. To address this, we introduce a methodology centered on HepScript, a dual-use Domain-Specific Language (DSL) for HEP data analysis workflows. HepScript serves as a shared formal interface, abstracting HEP analysis logic into a constrained syntax that is both intuitive for human experts and reliably generable by AI agents. First developed for the Beijing Spectrometer III (BESIII) experiment, HepScript hides the complexity of the underlying software stack, translating high-level analysis intent into low-level, production-ready code. In our case studies, this abstraction reduces the required human-written code by 93\%. Crucially, HepScript's constrained grammar defines a tractable action space, enabling AI agents to autonomously generate executable specifications for core analysis stages directly from published literature with a 95\% success rate. Our work demonstrates a scalable pathway toward human-AI collaborative systems, where a formally specified DSL acts as an unambiguous translation layer between human expertise, AI automation, and production environment, rendering previously intractable automation problems solvable.

2605.01416 2026-05-05 cs.CY cs.CL

Who Decides What Is Harmful? Content Moderation Policy Through A Multi-Agent Personalised Inference Framework

Ewelina Gajewska, Michal Wawer, Katarzyna Budzynska, Jaroslaw A. Chudziak

Comments The paper has been accepted to the 34th European Conference on Information Systems (ECIS 2026). The official paper version will appear in the conference proceedings

详情
英文摘要

The increasing scale and complexity of online platforms raises critical policy questions around harmful content, digital well-being, and user autonomy. Traditional content moderation systems rely on centralised, top-down rules, often failing to accommodate the subjective nature of harm perception. This paper proposes an LLM-based multi-agent personalised inference framework that filters content based on unique sensitivity profiles of individual users. Our architecture combines domain-specific Expert Agents, a Manager Agent for orchestrating content analysis and agent selection, and a Ghost Profile Agent for simulating user perspectives, to inform moderation decisions. Evaluated against a range of non-personalised baselines, the system demonstrates up to a 32% improvement in accuracy, showing increased alignment with individual user sensitivities. Beyond technical performance, our framework provides policy-relevant insights for platform governance, providing a scalable way to reconcile moderation policies with societal and individual digital rights

2605.01407 2026-05-05 cs.IR cs.CL

The Pre-Training Study of Expanded-SPLADE Models on Web Document Titles

Hiun Kim, Tae Kwan Lee, Taeryun Won

详情
英文摘要

Masked Language Modeling (MLM) pre-training is one of the primary ways to initialize Neural Information Retrieval (IR) models prior to retrieval fine-tuning. However, studies show that MLM pre-trained models have limited readiness and transfer learning issues for fine-tuning them into Neural Bi-Encoder models. This paper studies the effect of different pre-training datasets and pre-training options on the MLM pre-trained models for retrieval fine-tuning. The study focuses on the SPLADE-style model, which uses the MLM layer also at fine-tuning time. More specifically, we experimented with Expanded-SPLADE (ESPLADE) models, a specific instance of SPLADE models, and in-house web document titles are used as datasets. Pre-training, fine-tuning, and evaluation with optional test-time pruning of sparse vectors are conducted. Our observations are three-fold: First, fine-tuned models of higher retrieval effectiveness at both unpruned and most strict pruned settings are mostly pre-trained on a general corpus, and pre-trained with a higher learning rate, showing lower MLM accuracies. Second, in the most strict pruned setting, those models show higher-level retrieval cost and a higher variance in the length of the individual postings list. Third, the repetition of the general pre-training dataset does not have much effect on retrieval effectiveness. The experimentation empirically identifies the potential limitations for aligning MLM pre-training to ESPLADE fine-tuning. Also, the experimentation provides an empirical observation that, at most strict pruned settings, the retrieval effectiveness is better maintained by the higher-level retrieval cost, showing the trade-off relationship between the two in our setting.

2605.01404 2026-05-05 cs.AR cs.AI

AMSnet-q: Unsupervised Circuit Identification and Performance Labeling for AMS Circuits

Ze Zhang, Junzhuo Zhou, Yichen Shi, Zhuofu Tao, Rui Ji, Zhiping Yu, Quan Chen, Ting-Jung Lin, Lei He

详情
英文摘要

Analog and mixed-signal (AMS) circuit design remains heavily reliant on expert knowledge. While recent AI-driven automation tools can generate candidate topologies, they critically depend on manually curated datasets with functional and performance annotations -- a requirement that current large language models (LLMs) and vision models cannot automate. Existing approaches still require domain experts to manually interpret circuit functionality. We present AMSnet-q, a fully automated, unsupervised pipeline that eliminates human-in-the-loop annotation by converting schematic images directly into a labeled AMS circuit database. Unlike prior work that stops at netlist extraction, our framework automates the complete verification loop: it performs schematic-to-netlist conversion, topology-aware testbench generation, and simulation-based sizing validation to objectively determine circuit functionality. Validated in 28 nm technology, AMSnet-q processed 739 schematics from the AMSnet 1.0 dataset, automatically constructing a repository of 4 circuit classes, 105 distinct topologies, and 89,789 labeled device configurations. By decoupling human effort from dataset volume and reducing the workload to a one-time testbench template per circuit class, AMSnet-q enables scalable, objective, and fully automated AMS database construction.

2605.01400 2026-05-05 cs.HC cs.AI cs.CY cs.IR

Investigating the Effects of Different Levels of User Control in an Interactive Educational Recommender System

Qurat Ul Ain, Mohamed Amine Chatti, William Kana Tsoplefack, Rawaa Alatrash, Shoeb Joarder

Comments Submitted to TORS. arXiv admin note: text overlap with arXiv:2501.12894

详情
英文摘要

Educational recommender systems (ERSs) are becoming increasingly important in enhancing educational outcomes and personalizing learning experiences by providing recommendations of personalized resources and activities to learners, tailored to their individual learning needs. While user control is widely assumed to improve user experience, the effects of different levels of control in ERSs remain underexplored. To address this gap, we designed and evaluated an interactive ERS within the MOOC platform CourseMapper, where learners could interact with the input (i.e., user profile), process (i.e., recommendation algorithm), and output (i.e., recommendations) of the system. We conducted a between-subjects user study (N=184) to examine how varying levels of user control in an ERS influenced users' perceptions of the recommendation goals of perceived control, transparency, trust, satisfaction, and perceived quality. Our results show that enabling users to build and refine their profile is sufficient to promote positive perceptions of the ERS, while additional control options mainly reinforce these impressions. Moreover, perceived control is the only goal significantly affected by providing different levels of user control in the ERS, with input control exerting the strongest influence. Furthermore, different levels of control affect transparency, trust, satisfaction, and perceived quality in distinct yet interconnected ways. Overall, the findings provide empirical evidence that user control positively shapes transparency, trust, satisfaction, and perceived quality, though to varying extents.

2605.01394 2026-05-05 cs.SE cs.AI

LiveFMBench: Unveiling the Power and Limits of Agentic Workflows in Specification Generation

Dong Xu, Jialun Cao, Guozhao Mo, Junjie Hu, Cheng Wen, Hongyu Lin, Xianpei Han, Shengchao Qin, Cong Tian, Shing-Chi Cheung, Le Sun, Yaojie Lu

详情
英文摘要

Formal specification is essential for rigorous program verification, yet writing correct specifications remains costly and difficult to automate. Although large language models (LLMs) and agents have shown promising progress, their true capabilities and failure modes remain unclear. We present the first systematic and contamination-aware study of LLM- and agent-based formal specification generation for C programs. We introduce LiveFMBench, a continuously evolving benchmark of 630 ACSL (ANSI/ISO C Specification Language)-annotated C programs, including 360 newly collected cases designed to mitigate data leakage. Using this benchmark, we evaluate direct prompting with different sampling sizes, reasoning-enabled (thinking mode) inference, the agentic pipeline, and perform a fine-grained failure analysis. Experimental results reveal that naive evaluation substantially overestimates performance because models under direct prompting may exhibit unfaithful behaviors, such as deceiving automated provers or ignoring code-context constraints; after excluding such cases, the true specification generation accuracy drops by approximately 20\%. We further find that both increased sampling and thinking mode significantly improve success rates, with smaller models benefiting more from thinking mode. Agentic pipelines are particularly effective under low sampling budgets and on harder datasets. Failure analysis further shows that incorrect loop invariants are the dominant error type, while agentic pipelines notably reduce assertion errors. These results expose fundamental limitations in current LLM-based approaches and suggest they remain far from replacing human-authored formal specifications. We release LiveFMBench at https://huggingface.co/datasets/fm-universe/Live-FM-Bench and all evaluation artifacts to support future research.

2605.01392 2026-05-05 cs.SE cs.AI

Using LLMs in Software Design: An Empirical Study of GitHub and A Practitioner Survey

Yifei Wang, Ruiyin Li, Peng Liang, Yangxiao Cai, Zengyang Li, Mojtaba Shahin, Arif Ali Khan, Qiong Feng

Comments 29 pages, 8 images, 6 tables, Manuscript submitted to a Journal (2026)

详情
英文摘要

Recent advancements in Large Language Models (LLMs) have demonstrated significant potential across a wide range of software engineering tasks, including software design, an area traditionally regarded as highly dependent on human expertise and judgment. However, there has been little research focusing on how LLMs are used in software design, nor on the associated benefits and drawbacks. This paper aims to bridge this gap by empirically investigating how software developers utilize LLMs in the context of software design. We conduct a mixed-methods study, combining a mining study of 291 developer-ChatGPT conversations shared on GitHub with a survey of 65 software practitioners. Our findings reveal nine distinct categories of design tasks supported by ChatGPT, including architecture design, data model design, and the use of design patterns. We further characterize developer-ChatGPT interactions, showing that developers primarily use ChatGPT for knowledge acquisition and design-related code generation, with most tasks situated at the detailed design level. The study identifies seven key benefits of utilizing LLMs in software design as perceived by developers, such as better technology selection and the early detection of design flaws. We also uncover six limitations, including the generation of overly lengthy and difficult-to-read outputs, the creation of inexecutable or incorrect code, and a heavy reliance on context that can lead to hallucinated results. These findings provide an evidence-based characterization of current LLM use in software design from both open-source and practitioner perspectives, highlighting a tension between perceived benefits and limitations, which lays a foundation for future research and the development of effective techniques and tools to integrate LLMs into software design practices.

2605.01367 2026-05-05 quant-ph cs.LG cs.SY eess.SY

From Characterization To Construction: Generative Quantum Circuit Synthesis from Gate Set Tomography Data

King Yiu Yu, Aritra Sarkar, Erbing Hua, Maximilian Rimbach-Russ, Ryoichi Ishihara, Sebastian Feld

Comments 19 pages, 3 figures

详情
英文摘要

High-fidelity circuit execution on noisy intermediate-scale quantum devices is bottlenecked by compilation pipelines that disregard complex, correlated noise. To address this, this methodology article proposes a quantum machine learning control (QMLC) framework for generative quantum circuit synthesis from gate-set tomography (GST) data that bypasses the traditional two-step pipeline of characterizing native quantum gates via GST followed by unitary decomposition algorithms. Instead, a generative concept space is directly learnt from GST data, enabling conditional synthesis of quantum circuits on a desired output distribution. Our approach tokenizes GST germ circuits and embeds them into a structured latent space using a curriculum-learning-motivated strategy, starting with short circuits and progressively incorporating longer ones with diverse output statistics. The embedded sequences are processed by a set-vision transformer with permutation-invariant pooling, producing k-seed vectors that represent the learned concept space of the quantum device. Aggregating data across multiple circuits makes this latent representation inherently context-aware, capturing the shared physical noise environment (e.g., crosstalk, drift) that isolated gate metrics miss. We propose an unconditional diffusion model to sample from the concept space. During inference, a user provides a target measurement distribution, and the model generates a corresponding circuit. To ensure fidelity and robustness, the output is denoised using a diffusion model that operates on the target conditional covariance matrix. This end-to-end framework is a step towards context-aware, hardware-native circuit synthesis directly from raw GST data, which offers a new paradigm for integrating quantum control and compilation. The QMLC framework is particularly suited for near-term quantum devices with complex calibration procedures.

2605.01363 2026-05-05 hep-ex cs.LG hep-ph stat.ME

Data-Driven, Geometry-Aware Optimal-Transport Calibration of Flavor Tagger

Yeonjoon Kim, Un-ki Yang

Comments 32 Pages, 12 Figures

详情
英文摘要

Flavor-tagging calibrations are often provided either as scale factors measured at a finite set of working points or as binned corrections to a chosen one-dimensional discriminant. However, this approach falls short of providing continuous, event-level calibration across the full multicomponent outputs of modern taggers. This limitation leads to information loss in analyses that demand high-performance flavor tagging, restricting analyses to a limited set of predefined variables. In this work, we propose a geometry-aware framework that formulates flavor-tagger calibration as an optimal transport problem on the probability simplex. The transport maps are parameterized and trained in the isometric log-ratio coordinate system. Because the quadratic Euclidean cost of Brenier transport in this coordinate system is equivalent to the Aitchison distance on the simplex, the learned map induces a minimal deformation under the Aitchison geometry. Furthermore, we extract flavor-conditional target distributions directly from control-region data using an expectation-maximization (EM) technique that simultaneously fits multiple control regions, models each flavor component with a normalizing flow, and estimates the regional mixture fractions. The extracted targets are subsequently used to learn flavor-factorized transport maps. Because the joint estimation of mixture fractions and flexible component densities admits weakly constrained directions, we further introduce a linearized feedback-operator analysis that propagates the fitted composition covariance into the extracted component densities, separating data-constrained modes from those dominated by the composition prior. The simulation-based closure study demonstrates improved closure in dedicated control regions and in independent validation mixtures.

2605.01352 2026-05-05 cs.OS cs.AI cs.DC

VUDA: Breaking CUDA-Vulkan Isolation for Spatial Sharing of Compute and Graphics on the Same GPU

Bin Xu, Pengfei Hu, Wenxin Zheng, Jinyu Gu, Haibo Chen

详情
英文摘要

GPU-based simulation environments for embodied AI interleave physics simulation (CUDA) and photorealistic rendering (Vulkan) on a single device. We observe that two foundational scenarios -- simulation data generation and RL training -- can be naturally adapted to execute their simulation and rendering phases concurrently, presenting a significant opportunity to improve GPU utilization through spatial multiplexing. However, a fundamental obstacle we term execution isolation prevents this: CUDA and Vulkan create separate GPU contexts whose channels are bound to different scheduling groups, confining compute and graphics to mutually exclusive time slices. Existing spatial-sharing techniques are limited to the CUDA ecosystem, while temporal-sharing approaches underutilize available resources. This paper presents VUDA, a system that breaks execution isolation to enable spatial parallelism between CUDA compute and Vulkan graphics workloads. VUDA is built on two key observations: although CUDA and Vulkan expose different programming abstractions, their execution paths converge to a common channel primitive at the driver and hardware level; meanwhile, their virtual-address spaces are inherently disjoint, making safe page-table merging feasible without remapping. VUDA exposes a thin API for developers to annotate co-schedulable CUDA streams, and realizes spatial sharing through channel redirection into Vulkan's scheduling domain and page-table grafting to unify address spaces, eliminating all data copying on the critical path. Experiments on representative embodied-AI workloads show that VUDA delivers up to 85% higher throughput than temporal-sharing baselines, while improving GPU utilization and reducing end-to-end latency.

2605.01341 2026-05-05 cs.LO cs.AI

ABox Abduction for Inconsistent Knowledge Bases under Repair Semantics

Anselm Haak, Patrick Koopmann, Yasir Mahmood, Anni-Yasmin Turhan

详情
英文摘要

Given a knowledge base (KB) with a non-entailed fact, the ABox abduction problem asks for possible extensions of the KB that would entail this fact. This problem has many applications, ranging from diagnosis to explainability and repair. ABox abduction has been well-investigated for consistent KBs and classical semantics, but little is known for the case of inconsistent KBs, which can be caused by erroneous data. In this paper we define suitable notions of abduction in this setting and propose criteria that guide abduction towards "useful" hypotheses. To regain meaningful reasoning in the presence of inconsistencies, we use well-established repair semantics. We provide a comprehensive landscape of the complexity of ABox abduction under repair semantics, treating different variants of the abduction problem for the light-weight description logics DL-Lite and EL_bot.

2605.01335 2026-05-05 stat.ML cs.LG math.ST stat.TH

Mean Testing under Truncation beyond Gaussian

Yuhao Wang, Roberto Imbuzeiro Oliveira, Themis Gouleakis

详情
英文摘要

We characterize the fundamental limits of high-dimensional mean testing under arbitrary truncation, where samples are drawn from the conditional distribution $P(\cdot \mid S)$ for an unknown truncation set $S$ that may hide up to an $\varepsilon$-fraction of the probability mass. For distributions with $p$-th directional moments of magnitude at most $ν_{P,p}$, truncation induces a bias of order $O(ν_{P,p}\varepsilon^{1-1/p})$. This bias creates a sharp information-theoretic detectability floor: when the signal $α$ falls below this threshold, the null and alternative hypotheses are indistinguishable even with infinite data. Above this floor, we prove that a simple second-order test achieving near-optimal sample complexity $n = O\!\left(\frac{\|Σ_P\|}{(α-4ν_{P,p}\varepsilon^{1-1/p})^2}\sqrt{d}\right)$. We further identify a structural escape from this finite-moment bias barrier. Under a directional median regularity assumption, truncation bias improves to linear order $O(\varepsilon)$. This reveals an intermediate regime in which estimation requires $Θ(d)$ samples for uniform recovery, while testing recovers the classical $Θ(\sqrt d)$ rate once truncation bias is eliminated. Together, our results provide a unified framework for mean testing under truncation, connecting finite-moment, sub-Gaussian, and median-regular structural regimes.

2605.01319 2026-05-05 quant-ph cs.LG

Barren Plateaus as Destructive Interference: A Diagnostic Framework and Implications for Structured Ansatzes

Pilsung Kang

详情
英文摘要

Barren plateaus (BPs) are usually described by the exponential suppression of gradient variance, but the mechanism by which gradient signal disappears remains unclear. We show that this phenomenon can be understood as destructive interference among termwise gradient contributions. To make this perspective operational, we introduce a diagnostic framework based on the cancellation ratio $R_k$, the effective term count $N_{\mathrm{eff},k}$, and the interference-quality measure $B_{\mathrm{eff},k}=R_k\sqrt{N_{\mathrm{eff},k}}$. Under a random-sign model, $B_{\mathrm{eff},k}$ remains near a stable baseline, defining a random-sign cancellation regime. For the transverse-field Ising model (TFIM), we find that the hardware-efficient ansatz (HEA) remains close to this regime across system sizes and depths, whereas the Hamiltonian variational ansatz (HVA) systematically escapes it. In particular, HVA exhibits larger $B_{\mathrm{eff},k}$ not merely because $N_{\mathrm{eff},k}$ is larger, but because $R_k$ also remains systematically larger despite the broader term participation. This pattern indicates improved sign organization rather than simple term suppression. We further establish an exact identity that connects the proposed interference diagnostics directly to the standard variance-based theory of BPs. These results position destructive interference as a mechanistic interpretation of BP-like behavior in the regimes studied here, but they do not imply that BPs and destructive interference are universally interchangeable across all architectures and settings.

2605.01307 2026-05-05 eess.SP cs.AI cs.NI

Spectral- and Energy-efficient Multi-BS Multi-RIS Pinching-antenna Systems: A GNN-based Approach

Changpeng He, Yang Lu, Wei Chen, Bo Ai, Arumugam Nallanathan, Zhiguo Ding

详情
英文摘要

This paper investigates coordinated downlink transmission in a multi-base station (multi-BS) multi-reconfigurable intelligent surface (multi-RIS)-assisted pinching-antenna (PA) system, where each user equipment (UE) is associated with a single BS and each BS is equipped with movable PAs deployed on parallel waveguides. We formulate sum rate (SR) and energy efficiency (EE) maximization problems by jointly optimizing PA placement, RIS phase shifts, transmit beamforming, and BS-UE association under constraints of inter-PA spacing, power budget, and unit-modulus phase shift. To address the resulting highly coupled mixed-variable problem, we propose a three-stage graph neural network (GNN) that integrates heterogeneous and homogeneous graph representations and is trained end-to-end in an unsupervised manner. Extensive numerical results demonstrate that the proposed three-stage GNN consistently outperforms representative system and learning baselines, generalizes well to unseen numbers of UEs, RISs, and BSs, and maintains millisecond-level inference time. Besides, the results validate the effectiveness of the proposed design from both system and architectural perspectives. Moreover, PAs are shown to enhance SR and EE, and the performance gain is enlarged with increasing number of PAs.

2605.01306 2026-05-05 physics.optics cs.LG physics.app-ph

Machine Learning Enhanced Laser Spectroscopy for Multi-Species Gas Detection in Complex and Harsh Environments

Mohamed Sy

Comments PhD thesis

详情
英文摘要

Laser absorption spectroscopy (LAS) is a well-established technique for non-intrusive measurement of gas species in combustion and atmospheric environments, but conventional methods struggle with multi-species mixtures under dynamic or interference-laden conditions. Overlapping spectral features, noise, and incomplete reference data limit reliability when unknown or weakly absorbing species are present. This dissertation develops diagnostics combining LAS with machine learning (ML) to address these limitations. Deep denoising autoencoders (DDAEs) are applied to shock-tube measurements during high-speed hydrocarbon pyrolysis, improving signal fidelity and detection limits for trace species. A structured unsupervised framework, HT-SIMNet, then mitigates interference from unknown species without full calibration data, using spectral augmentation and a Noise2Noise-inspired scheme to isolate species in reactive systems. Where reference spectra are unavailable, UnblindMix, an autoencoder-based blind source separation method, reconstructs concentrations and spectral signatures directly from mixture data, validated on mixtures of up to eight components. To recover weakly absorbing species masked by broader absorbers, a feature-engineering method based on first derivatives and convolutions selectively highlights minor species. Finally, VOC-certifire combines randomized smoothing with Voigt-based spectral perturbation to provide certifiable classification of volatile organic compounds under varying conditions. All techniques are experimentally validated and benchmarked. The integration of spectroscopic hardware with ML offers a path toward real-time, interference-resilient, reference-free gas detection for combustion science, environmental monitoring, and industrial safety.

2605.01298 2026-05-05 cs.CR cs.CV

Checkerboard: A Simple, Effective, Efficient and Learning-free Clean Label Backdoor Attack with Low Poisoning Budget

Yi Yang, Jinyang Huang, Binbin Liu, Feng-Qi Cui, Xiaokang Zhou, Zhi Liu, Jie Zhang, Meng Li

详情
英文摘要

Backdoor attacks threaten the deep learning supply chain by poisoning a small fraction of the training data so that a model behaves normally on clean inputs but misclassifies trigger-carrying inputs to an attacker-chosen target class. Clean-label backdoor attacks are especially dangerous because poisoned samples remain label-consistent and are therefore harder to detect. Yet existing clean-label attacks typically rely on expensive optimization, surrogate-model training, or nontrivial data access. We present Checkerboard, a theoretically grounded, learning-free clean-label backdoor attack that is effective, efficient, and simple to implement. From a linear separability formulation, we derive a checkerboard trigger in closed form, removing the need for surrogate-model training and trigger optimization. For texture-rich datasets, we introduce Complexity-driven Sample Selection, which uses only target-class data to improve trigger-to-background contrast by selecting low-complexity images for poisoning. Across four benchmark datasets, Checkerboard outperforms 8 baseline attacks and achieves state-of-the-art performance under low poisoning budgets. For example, on CIFAR-10, under a trigger perturbation budget of $10/255$, poisoning 20 training samples achieves $99.99\%$ Attack Success Rate (ASR). On ImageNet-100, a poisoning rate of only $0.46\%$ yields over $94\%$ ASR without degrading clean accuracy. The proposed attack also remains effective against state-of-the-art backdoor defenses and shows strong resistance to adaptive defenses.

2605.01282 2026-05-05 eess.IV cs.AI

A Target-Free Harmonization Method for MRI

Minjun Kim, Dong Ju Mun, Hwihun Jeong, Hangyeol Park, Haechang Lee, Se Young Chun, Jongho Lee

Comments 37 pages, 10 figures

详情
英文摘要

In MRI, variations in scan parameters, sequence, or hardware can lead to discrepancies in image appearance, even for the same subject. These inconsistencies, known as domain shifts, can hinder image analysis and degrade the performance of deep learning models trained on data from specific target domains. MRI image harmonization aims to address these issues by aligning source domain images to the target domain images while preserving biological information such as anatomical structures. However, most existing harmonization approaches require access to both source and target domain data in training or test time. This dependence induces data sharing between institutions, raising concerns about patient privacy and substantially limiting the harmonization approaches that can be practically deployed in clinical settings. To overcome these limitations, we introduce TgtFreeHarmony, the harmonization framework tailored for target-free scenarios, eliminating the need for target domain data and any data sharing, enabling privacy-preserving harmonization directly within the source institution. Our approach estimates the target domain style by searching the manifold of MRI domain style constructed via a disentanglement-based generator using Bayesian optimization guided by the performance of a downstream task model, which is trained on target domain data. We evaluated our method on the brain tissue segmentation task across multiple institutes and demonstrated that it effectively harmonizes source images into target images, leading to improved downstream task performance. By enabling harmonization without any access to target-domain data, TgtFreeHarmony establishes a new direction of harmonization preserving data privacy that can be realistically deployed within clinical environments.

2605.01280 2026-05-05 cs.DC cs.AI

Position: LLM Serving Needs Mathematical Optimization and Algorithmic Foundations, Not Just Heuristics

Zijie Zhou

详情
英文摘要

This position paper argues that LLM inference serving has outgrown generic heuristics and now demands mathematical optimization and algorithmic foundations. Despite rapid advances in serving systems such as vLLM and SGLang, their algorithmic cores remain largely unchanged from classical distributed computing: request routing uses join-shortest-queue or round-robin, scheduling defaults to FIFO, and KV cache eviction follows LRU. These general-purpose policies ignore the distinctive structure of LLM inference--dynamically growing KV cache memory, prefill-decode phase asymmetry, unknown output lengths, and continuous batching constraints. We contend that the field must develop mathematical models capturing these characteristics, enabling the design of algorithms with provable performance guarantees across diverse workloads, rather than heuristics that may succeed in some scenarios but fail unpredictably in others. Emerging work at the intersection of operations research and ML systems demonstrates that principled methods can match or exceed heuristic performance while providing theoretical guarantees. We call on the community to recognize algorithmic design for LLM serving as a research frontier.