arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 1107
2605.01264 2026-05-05 cs.SE cs.LG

FeedbackLLM: Metadata driven Multi-Agentic Language Agnostic Test Case Generator with Evolving prompt and Coverage Feedback

Kushal Jasti, Tejamani Prashanth Sahu, Rishitha Pentyala, Muvvala Mohit, Vivek Yelleti

详情
英文摘要

Traditional approaches to test case generation often involve manual effort and incur significant computational overhead. Additionally, these approaches are not scalable, and hence, unsuitable for complex software systems. Recently, Large Language Models (LLMs) have been applied to software testing. However, single-shot prompt engineering-based approaches tend to hallucinate and generate redundant test cases, resulting in fewer branches. To handle the above-mentioned limitations, in this paper, we propose FeedbackLLM, a novel automated language-agnostic test case generation framework based on tightly coupled two-stage approach. In the first stage, FeedbackLLM extracts the input constraints by parsing source code and generates the possible test cases. The quality of the test cases is evaluated in the second stage by the following two specialized LLM feedback agents: (i) Line Feedback Agent: extracts the metadata related to missed line executions and (ii) Branch Feedback Agent: extracts the metadata of the unexecuted branch conditions. The above agents operate in a two-stage process, communicating in tandem, and this procedure is repeated for k-steps. Further, we also introduced a redundancy prevention cache to avoid duplicate API requests and avoid unnecessary execution cycles. The performance of the proposed architecture is evaluated on the standard benchmark programs related to C and Python programs. FeedbackLLM demonstrated more line and branch coverage than baseline tools while scaling linearly in execution time.

2605.01263 2026-05-05 cs.DS cs.LG

New Bounds for Kernel Sums via Fast Spherical Embeddings

Tal Wagner

Comments ICML 2026

详情
英文摘要

We study query time bounds for the fundamental problem of estimating the kernel mean $\frac1{|X|}\sum_{x\in X}\mathbf{k}(x,y)$ of a query $y$ in a finite dataset $X\subset\mathbb{R}^d$ up to a prescribed additive error $\varepsilon$. The best known bounds for the Gaussian kernel are $O(d/\varepsilon^2)$, $\widetilde O(d+1/\varepsilon^4)$, and $\widetilde O(d+Δ^2/\varepsilon^2)$, where $Δ$ is the diameter of a region containing the points. We prove the new bound $\tilde O(d+\varepsilonΔ^2+1/\varepsilon^3)$, which improves over the previous ones in regimes with small error $\varepsilon$ and intermediate diameter $Δ$. At the center of our proof is a new fast spherical embedding theorem in the sense introduced by Bartal, Recht and Schulman (2011), which limits the embedded data diameter while preserving local Euclidean distances and avoiding ``distance collapse'' at larger scales. This fast embedding theorem may be of independent interest.

2605.01251 2026-05-05 cs.HC cs.RO

What Does a Meow Mean? In Search of Intuitively Understandable Communication by a Nonverbal Companion Robot

Vivienne Bihe Chi, Claudia B. Rébola, Bertram F. Malle

Comments To appear in the Proceedings of the 18th International Conference on Social Robotics (ICSR 2026)

详情
英文摘要

Older adults living alone have a number of challenges, and robots can help with some of them--by providing reminders, initiating activity, or offering comfort. As part of developing a cat robot with limited assistive functions, we designed a set of nonverbal communication signals, both auditory (cat sounds) and visual (icons on a small display). To evaluate these signals we used a mixed-methods, user-centered approach. After a pilot study, a focus group with older adults suggested revisions to the initial signal set. A large-sample online experiment then tested whether adults over the age of 65 could accurately infer the robot's communicative intentions. When both visual and auditory signals were present, accuracy was high. When visual signals were absent, accuracy often decreased; when auditory signals were absent, accuracy sometimes increased. So the auditory signals were less helpful, except when the robot conveyed strong sentiments (e.g., purring while being petted).

2605.01245 2026-05-05 cs.HC cs.AI

The Garden of Forking Paths: Narrative Arc-Conditioned Gameplay Planning

Yunge Wen, Chenliang Huang, Hangyu Zhou, Zhuo Zeng, Chun Ming Louis Po, Julian Togelius, Timothy Merino, Sam Earle

详情
英文摘要

Narrative archetypes (e.g., Hero's Journey, Three-act structure) provide universal story structures that resonate across cultures and media and are important for video game storytelling, yet existing LLM-based methods lack explicit use of these archetypes in procedurally generated games. We propose Forking Garden, a framework for narrative arc-conditioned gameplay planning that generates branching games from user-provided storylines. Our approach first generates a diverse pool of independent nodes, then assembles them into a dungeon graph via arc-guided constraint algorithms, where each node achieves multimodal alignment of gameplay elements. We develop an end-to-end interactive system that instantiates the framework.

2605.01238 2026-05-05 cs.HC cs.CV

EduGage: Methods and Dataset for Sensor-Based Momentary Assessment of Engagement in Self-Guided Video Learning

Zikang Leng, Edan Eyal, Yingtian Shi, Jiaman He, Yaqi Liu, Thomas Plötz

详情
英文摘要

Engagement, which links to attentional, emotional, and cognitive dimensions, plays an important role in learning. In online and video-based learning environments, learners often need to regulate their own interactions with instructional materials. Measuring and reflecting on engagement can therefore support both learners and adaptive learning systems. In this study, we use wearable and camera-based sensing devices to collect physiological and motion signals, including PPG, ECG, EDA, EEG, IMU, heart rate, temperature, and eye-tracking data, to estimate learner engagement. We conducted a user study with 16 participants in a video-based learning scenario, where participants completed learning tasks and provided repeated in-situ self-reports of engagement through brief probes. We develop and evaluate a system for engagement estimation, compare different sensing modalities, and further analyze the feasibility and effectiveness of multimodal modeling for characterizing learner engagement. Across participant-based cross-validation, our model achieves an MAE of 0.81, 83.75% within-1 accuracy, 73.93% binary accuracy, and 68.45% binary Macro-F1, outperforming sensor-free, statistical, deep temporal, foundation-model, and LLM-based baselines. Our results suggest that fine-grained engagement estimation is feasible but inherently noisy, and that practical systems should prioritize lightweight combinations of behavioral and physiological signals over full multimodal instrumentation. We release the EduGage dataset, including synchronized multimodal sensor signals, probe-aligned momentary engagement labels, video metadata, quizzes, and study materials, to support reproducible research on fine-grained sensor-based engagement modeling in self-guided learning.

2605.01219 2026-05-05 cs.MM cs.CV cs.SD eess.IV

Multimodal Confidence Modeling in Audio-Visual Quality Assessment

Mayesha Maliha R. Mithila, Mylene C. Q. Farias

Comments Accepted at ICIP 2026, 6 pages, 4 figures, no supplementary material

详情
英文摘要

Audio-visual quality assessment (AVQA) is essential for streaming, teleconferencing, and immersive media. In realistic streaming scenarios, distortions are often asymmetric, where one modality may be severely degraded while the other remains clean. Still, most contemporary AVQA metrics treat audio and video as equally reliable, causing confidence-unaware fusion to emphasize unreliable signals. This paper proposes MCM-AVQA, a multimodal confidence-aware AVQA framework that explicitly estimates modality-specific confidence and injects it into a dedicated audio-visual mixer for cross-modal attention. The Audio-Visual Mixer utilizes frame-level, confidence-guided channel attention to gate fusion, modulating feature interaction between modalities so that high-confidence streams dominate while unreliable inputs are suppressed, preserving temporal degradation patterns. A multi-head visual confidence estimator turns frame-level artifact probabilities into temporally smoothed, clip-level visual confidence scores, while an audio confidence module derives confidence from speech-quality cues without requiring a clean reference. Experiments on multiple AVQA benchmarks show that MCM-AVQA, and specifically its confidence-guided Audio-Visual Mixer, improve correlation with human mean opinion scores and yield more interpretable behavior under real-world asymmetric audio-visual distortions.

2605.01170 2026-05-05 physics.app-ph cs.RO

A skin-like conformal sensor for real-time shape mapping

Kaiping Yin, Sooik Im, Chaorui Qiu, Yun Bai, Xiangyu Lu, Chenhang Li, Junjie Yao, Xiaoyue Ni

Comments 13 pages, 5 figures

详情
英文摘要

Reliable real-time 3D shape sensing is essential for robust control and interpretation of deformable systems during motion. Existing vision-based approaches require line-of-sight and complex instrumentation, limiting operation in occluded and space-constrained settings. Here, we introduce a scalable, skin-like sensor that reconstructs its continuous 3D deformation in real time from distributed strain measurements. The device embeds a 2D array of mirror-stacked, printed oxidized eutectic gallium-indium (o-EGaIn) strain gauges within an elastomeric film to measure off-neutral-axis strains. Combined with a mechanics-informed observation model and a fast optimization routine, the system estimates local curvature, elongation, offset, and orientation under concurrent stretching, bending, and indentation, enabling reconstruction of complex surfaces. A 5-by-5 array with a 12 mm pitch achieves a mean surface reconstruction error of 0.62 mm with 0.1s latency across all tested scenarios. When conforming to complex surfaces, the sensor provides fast 3D shape mapping of the underlying geometry. Demonstrations involving palm gesturing, finger indentation, and contact-induced balloon deformation highlight utility for epidermal motion tracking, haptic interaction, and intraoperative monitoring.

2605.01163 2026-05-05 cs.IR cs.LG

Multimodal Data Curation Through Ranked Retrieval

Pratyush Muthukumar, Harshil Kotamreddy, Sarah Amiraslani, Tomo Kanazawa, Ramani Akkati, Shaan Jain, Andrew Mathau

Comments ICLR DATA-FM 2026

详情
英文摘要

Shared embedding spaces are widely used for multimodal search and data curation. In practice, two problems often limit how well this works. First, embeddings can reflect modality more than meaning, so examples cluster by input type even when the underlying content matches. Second, the paired supervision used to train these spaces is often noisy. When we blend many heterogeneous, human-labeled datasets, these issues reinforce each other and degrade cross-modal retrieval. We present a framework that improves alignment by acting on both the training pairs and the embedding model. Symmetric Nucleus Subsampling (SNS) refines training pairs by trimming raw inputs and annotations to the portions that best support each other. Expert Embedding Engine (EEE) combines complementary embedding experts using a learned projection network, together with a bias-aware objective that reduces modality-driven separation in the embedding space. We demonstrate that this approach collapses the modality gap by over 90% on average vs base embedding experts and is a strong data curator, with datablends from our method outperforming stratified sampling and traditional curation baselines in downstream model performance.

2605.01160 2026-05-05 cs.SE cs.AI

The Productivity-Reliability Paradox: Specification-Driven Governance for AI-Augmented Software Development

Sabry E. Farrag

Comments 30 pages, 4 tables, 1 figure, 71 references

详情
英文摘要

Since 2022, AI-powered coding assistants have produced contradictory evidence: controlled studies report 20-56% productivity gains on well-scoped tasks, while the most rigorous RCT documents a 19% slowdown for experienced developers, and telemetry across 10,000+ developers shows 98% more pull requests but 91% longer review times with flat delivery metrics. This paper argues these findings constitute the Productivity-Reliability Paradox (PRP): a systematic phenomenon emerging from non-deterministic code generators and insufficient specification discipline. Through a multivocal literature review of 67 sources (2022-2026), this paper: (1) formally defines the PRP with three moderating variables (task abstraction, codebase maturity, developer experience) and two amplifying mechanisms (code review bottleneck, context window constraint); (2) proposes the AI-Augmented Methodology Taxonomy (AAMT), classifying six methodologies under three AI integration tiers; (3) introduces the Specification Governance Model (SGM), grounded in Transaction Cost Economics, with a practical governance decision guide; and (4) evaluates Spec Kit and TDAD as SGM instantiations via a four-month pilot study. Specification discipline, not model capability, is the binding constraint on AI-assisted software dependability.

2605.01133 2026-05-05 cs.CR cs.LG cs.MA

When Embedding-Based Defenses Fail: Rethinking Safety in LLM-Based Multi-Agent Systems

Lingxi Zhang, Guangtao Zheng, Hanjie Chen

详情
英文摘要

Large language model (LLM)-powered multi-agent systems (MAS) enable agents to communicate and share information, achieving strong performance on complex tasks. However, this communication also creates an attack surface where malicious agents can propagate misinformation and manipulate group decisions, undermining MAS safety. Existing embedding-based defenses aim to detect and prune suspicious agents, but their effectiveness depends on a clear separation between the text embeddings of malicious and benign messages. Attackers can circumvent such defenses by crafting messages whose embeddings lie close to benign ones. We analyze this failure mode theoretically and validate it empirically with three attacks, Slow Drift, Benign Wrapper, and Chaos Seeding. Our analysis further reveals a fundamental limitation of embedding-based defenses: because they rely solely on the text embeddings, they ignore token-level confidence signals such as logits, which can remain informative when embeddings are not distinguishable under attack. We propose using confidence scores to prune or down-weight messages during MAS communication. Experiments show improved robustness across models, datasets, and communication topologies. Moreover, we find that the effectiveness of confidence signals decays over communication rounds, highlighting the importance of early intervention. This insights can inform and inspire future work on MAS attacks and defenses.

2605.01104 2026-05-05 cs.SE cs.CL cs.HC

RECAP: An End-to-End Platform for Capturing, Replaying, and Analyzing AI-Assisted Programming Interactions

Keyu He, Qianou Ma, Valerie Chen, Wayne Chi, Tongshuang Wu

详情
英文摘要

Understanding how developers interact with AI coding assistants requires more than chat logs or git histories in isolation; it requires reconstructing the full context: which prompt led to which edit, what the developer tried and discarded, and how their strategy evolved over time. We present RECAP (Replay and Examine Captured AI Programming), an open-source platform that (1) passively records AI chat sessions and fine-grained code edits inside VS Code without disrupting the developer's workflow, (2) merges them into a unified timeline for interactive session replay, and (3) exposes an extensible analysis layer, with example modules for behavioral classification and AI reliance measurement. Deployed in a university software engineering course, RECAP captured 2,034 prompts and 8,239 code edits from 41 students across a multi-week project. We demonstrate how the platform's linked data and replay capabilities enable analyses of developer-AI interaction patterns that no single data source could support.

2605.01091 2026-05-05 cs.CY cs.AI cs.MA

Governing What the EU AI Act Excludes: Accountability for Autonomous AI Agents in Smart City Critical Infrastructure

Talal Ashraf Butt, Muhammad Iqbal, Razi Iqbal

Comments 24 pages, 3 figures, 8 tables. Submitted to Computer Law & Security Review

详情
英文摘要

When a traffic signal controller adjusts green phases and a grid manager curtails power on the same corridor, each system may comply with its own obligations. The resident who suffers the combined effect has no single authority to hold accountable and, under the EU AI Act, limited means to obtain an explanation. Annex III, point 2 excludes safety-component AI in critical infrastructure from Article 86 explanation rights and Article 27 fundamental-rights impact assessment. Provider and deployer duties under Articles 9-15 still apply, and residual pathways under the GDPR, NIS2, and tortious liability offer partial coverage. The Act's principal resident-facing accountability instruments are nonetheless narrowed for the autonomous infrastructure systems most likely to interact across agencies. The paper traces this accountability deficit through four residual pathways (GDPR Article 22, GDPR transparency obligations, tortious liability, and NIS2) and shows that each is structurally bounded by individual-controller, individual-decision scope. As a governance response, it presents AgentGov-SC, a three-layer architecture (Agent, Orchestration, City) specifying 25 governance measures with bidirectional traceability to the EU AI Act, ISO/IEC 42001, and the NIST AI Risk Management Framework. Five conflict resolution rules and an autonomy-calibrated activation model complete the design. A scenario analysis traces governance activation through a multi-agent corridor cascade involving three documented UAE smart-city systems, with a contrasting single-system scenario confirming proportional activation. The paper contributes a regulatory gap analysis and governance architecture for an increasingly important class of urban AI deployment that existing frameworks treat as bounded and isolated.

2605.01078 2026-05-05 cs.CR cs.AI

A Sentence Relation-Based Approach to Sanitizing Malicious Instructions

Soumil Datta, Melissa Umble, Daniel S. Brown, Guanhong Tao

详情
英文摘要

Retrieval-augmented generation and tool-integrated LLM agents increasingly depend on external textual sources. This reliance broadens the available attack surface, allowing adversaries to insert malicious instructions that trigger unintended model behaviors. Current defensive measures often utilize LLM-based detectors to filter such content, but these approaches remain vulnerable to optimization-based attacks. Additionally, training-based methods frequently fail to generalize to novel data distributions. To resolve these issues, we introduce SONAR, a prompt sanitization framework that identifies and removes injected content using metrics from natural language inference. Specifically, SONAR constructs a sentence-level relational graph across the user query and external data. By using entailment and contradiction scores as edge weights, the system identifies sentences that deviate from the core task. It then employs connectivity-driven pruning to eliminate flagged injection seeds and their related neighbors while maintaining benign context. Rigorous evaluations across several models and datasets show that SONAR reduces the attack success rate to nearly zero, significantly outperforming nine established baseline defenses.

2605.01074 2026-05-05 cs.NE cs.LG

Benchmarking local Hebbian learning rules for memory storage and prototype extraction

Anders Lansner, Andreas Knoblauch, Naresh B Ravichandran, Pawel Herman

Comments 31 pages, 9 + 2 suppl figures, 5 tables

详情
英文摘要

Associative memory or content-addressable memory is an important component function in computer science and information processing, and at the same time a key concept in cognitive and computational brain science. Many different neural network architectures and learning rules have been proposed to model the brain's associative memory while investigating key component functions like figure-ground segmentation, perceptual reconstruction and rivalry. A less investigated but equally important capability of associative memory is prototype extraction where the training set comprises distorted prototype instances and the task is to recall the correct generating prototype given a new distorted instance. In this paper we benchmark associative memory function of seven different Hebbian learning rules employed in non-modular and modular recurrent networks with winner-take-all dynamics operating on moderately sparse binary patterns. We measure pattern storage and weight information capacity, prototype extraction capabilities, and sensitivity to correlations in data. The original additive Hebb rule comes out with worst capacity, covariance learning proves to be robust but with moderate capacity, and the Bayesian-Hebbian learning rules show highest capacity in almost all different conditions tested.

2605.01072 2026-05-05 hep-th cs.LG

Reconstructing conformal field theoretical compositions with Transformers

Haotian Cao, Garrett Merz, Kyle Cranmer, Gary Shiu

详情
英文摘要

We study the use of transformers to reconstruct the compositions of tensor products of two-dimensional rational conformal field theories (RCFTs) based on their low-energy spectra. The task is challenging due to its combinatorial nature. The constituent theories are characterized by their central charges and affine Lie algebra labels. We achieve 98% accuracy in recovering the constituents of tensor products theories constructed from Wess-Zumino-Witten models. We further demonstrate that our method generalizes to CFTs with larger central charge and unseen classes of RCFTs by adding a small number of out-of-domain examples. Our results show that transformers are effective at this task and point towards a new tool for bulk reconstruction in AdS/CFT.

2605.01060 2026-05-05 cs.DC cs.LG

SURGE: SuperBatch Unified Resource-efficient GPU Encoding for Heterogeneous Partitioned Data

Shashank Kapadia, Deep Narayan Mishra, Sujal Reddy Alugubelli, Ajay Kumar, Swapnil Yadav, Rishi Bhatia

Comments 15 pages, 10 figures, 11 tables

详情
英文摘要

We present SURGE, a streaming GPU encoding system deployed in production to generate embeddings for over 800 million texts across 40,000 logical partitions. Production embedding pipelines face a tension between logical data partitioning and efficient GPU utilization: processing each partition independently incurs $P$ inter-process communication (IPC) calls whose overhead limits throughput for compute-light models. Our contributions are analytical: (i) a cost model (Theorem 1) predicting throughput within 2% across three encoders spanning a 15$\times$ parameter range; (ii) a memory-safety bound (Lemma 3) enabling a streaming two-threshold policy with peak memory $O(B_{\min} + n_{\max})$ rather than $O(N)$; and (iii) a $ϕ$/CV decision framework characterizing when the pattern applies beyond our workload. The naive fix of batching at fixed size requires $O(N)$ peak memory (32.7 GB at 10M texts; infeasible beyond ~60M on 192 GB nodes), produces no output until all encoding completes, and offers no fault tolerance. SURGE achieves the same throughput with $O(B_{\min} + n_{\max})$ bounded memory (2.6 GB), 68$\times$ faster time-to-first-output, and crash recovery at SuperBatch granularity. On 10M texts with 4 NVIDIA L4 GPUs, SURGE delivers 26,413 texts/s -- matching fixed-batch throughput while using 12.6$\times$ less memory. We validate on bge-base (109M, $d$=768, error 1.3%) and across log-normal $σ$ in {1.0, 1.72, 2.5} (speedup invariant within $\pm$3%), and compare against a partition-batched baseline (PB-PBP-LB), against which SURGE retains a 7% throughput edge and 2.5$\times$ faster TTFO. Complementary engineering -- zero-copy Arrow serialization (22-25$\times$ speedup) and async I/O pipelining (up to 93% benefit) -- realizes the design but is not the contribution.

2605.01055 2026-05-05 cs.DC cs.AI

SCION: Size-aware Policy Orchestration for Nonstationary Object Caches (Long Paper Version)

Qizhi Wang

Comments 17 pages, 4 figures, 26 tables. Code repository: https://github.com/Icemap/SCION

详情
英文摘要

Object caches underpin cloud and edge services, but production workloads are heterogeneous, nonstationary, and throughput-constrained. Recent simple non-ML policies such as SIEVE and S3-FIFO set a strong baseline, so any learned method must be overhead-aware, robust under drift, and competitive with strong experts. We present SCION, a lightweight policy-orchestration framework that selects among a small set of deployable cache policies using a tiny workload fingerprint computed off the critical path. Our prototype, AUTO, uses short-prefix statistics of object size, cacheability, reuse, and cache size, then applies an offline-trained linear selector to choose among GDSF, S3-FIFO, SIEVE, LHD, W-TinyLFU-AV, and DynamicAdaptiveClimb; a simpler SCION-P90 variant uses only a p90 threshold. In a CPU-only, trace-driven evaluation on 30 public object-cache traces and a separate HR-Cache simulator subset, AUTO improves cacheable-only object miss ratio over SIEVE on a majority of workloads, stays close to the best single expert on average, enables explicit OMR/BMR tradeoff selection, and remains competitive on byte miss ratio. Under a fast-policy budget, AUTO-fast achieves lower cost than the best fixed fast policy. SCION reduces regime-mismatch risk while keeping the hot path unchanged.

2605.01047 2026-05-05 cs.CR cs.AI cs.CL cs.LG

LLM Ghostbusters: Surgical Hallucination Suppression via Adaptive Unlearning

Joseph Spracklen, Pedram Aghazadeh, Farinaz Koushanfar, Murtuza Jadliwala

详情
英文摘要

Hallucinations, outputs that sound plausible but are factually incorrect, remain an open challenge for deployed LLMs. In code generation, models frequently hallucinate non-existent software packages, recommending imports and installation commands for fictional libraries. This creates a critical supply-chain vulnerability: an attacker can proactively register such packages on public registries with malicious payloads that are subsequently installed and executed by developers or autonomous agents, a class of package confusion attack known as slopsquatting. Once a model is deployed, mitigating this failure mode is difficult: full retraining is costly, and existing approaches either cause severe degradation of model utility or rely on a pre-specified forget-set, an assumption that does not apply to the unbounded space of hallucinations. To address this problem, we present Adaptive Unlearning (AU), a post-deployment framework that surgically suppresses hallucinations while preserving general model utility. AU introduces a hybrid token-level objective that simultaneously reinforces valid outputs and suppresses hallucinated ones. Combined with an adaptive discovery loop that continuously surfaces new hallucination-inducing contexts without human supervision, AU enables generalization to unseen prompts and hallucinations. We demonstrate that AU reduces package hallucination rates by 81%, corresponding to a substantial reduction in slopsquatting attack surface, while maintaining performance on standard coding benchmarks. Our analysis shows that distributional changes are concentrated on package-related generations, leaving general coding behavior largely unaffected and confirming that AU's effect is isolated to the targeted distribution. AU operates entirely on model-generated data, requires no human annotation, and generalizes across domains.

2605.01040 2026-05-05 cs.CE cs.LG

Differentiable Multiphysics Co-Optimization via Implicit Neural Representations: A Transient Hamburger-Cooking Benchmark

Navid Zobeiry

Comments Preprint. 24 pages, 5 figures

详情
英文摘要

The co-optimization of geometry and physical parameters remains challenging in transient multiphysics systems involving moving boundaries, nonlinear material response, phase transitions, and competing objectives. Existing methods often optimize geometry and physical variables separately, rely on simplified steady-state physics, or require offline data generation and reduced design spaces. Here, we present an end-to-end differentiable co-optimization framework that couples an implicit neural representation of geometry with a JAX-compiled Eulerian multiphysics solver. Geometry is represented as a signed distance field using Fourier-feature-encoded spatial coordinates, while boundary conditions, initial conditions, process controls, and material parameters are optimized within the same differentiable loop. Continuous relaxations represent non-smooth physical transitions while preserving compatibility with reverse-mode automatic differentiation and backpropagation through time. We demonstrate the framework using a transient hamburger-cooking benchmark, selected as an interpretable multiphysics problem rather than a culinary optimization exercise. The benchmark combines conductive and convective heat transfer, latent energy effects, moisture and fat transport, shrinkage-induced geometry evolution, evolving contact boundary conditions, flipping-induced boundary-condition changes, and competing quality objectives. Results show that geometry-only optimization modifies shape to relieve thermal bottlenecks, while joint co-optimization distributes the design response across geometry, material state, process variables, and boundary conditions through gradients propagated over the full transient rollout.

2605.01003 2026-05-05 stat.ME cs.LG eess.SP

Pi-Change: A Prior-Informed Multiple Change Point Detection Algorithm

Jonathon Jacobs, Shanshan Chen

详情
英文摘要

Statistical change point (CP) detection methods typically rely on likelihood-based inference and ignore contextual information about plausible CP locations beyond the observed sequence. Although informative priors provide a natural way to incorporate such information, general and computationally efficient methods for doing so are lacking, especially for multiple CP detection. To address this gap, we propose a prior-informed CP detection algorithm (Pi-Change) that incorporates prior information on CP locations through a time-varying penalty term. We prove that the proposed penalty can be embedded in the Pruned Exact Linear Time framework while preserving the dynamic programming recursion and pruning rule required for efficient multiple CP detection. Across simulation studies and three time-series applications, Pi-Change discourages spurious CPs unsupported by prior information, remains robust to prior misspecification, and improves detection accuracy. More broadly, Pi-Change extends multiple CP detection beyond purely data-driven fitting by incorporating partial prior knowledge in a computationally efficient and interpretable way. It is particularly useful when CPs arise from heterogeneous mechanisms or are associated with known external events, helping quantify the delay between an event and the resulting structural change.

2605.00974 2026-05-05 cs.CR cs.CL

SRTJ: Self-Evolving Rule-Driven Training-Free LLM Jailbreaking

Jindong Li, Ying Liu, Yali Fu, Jinjing Zhu, Leyao Wang, Menglin Yang, Rex Ying

详情
英文摘要

LLMs are increasingly equipped with safety alignment mechanisms, yet recent studies demonstrate that they remain vulnerable to jailbreaking attacks that elicit harmful behaviors without explicit policy violations. While a growing body of work has explored automated jailbreak strategies, existing methods face several fundamental challenges, including the lack of systematic utilization of both successful and failed attack experiences, as well as the absence of principled mechanisms for composing and selecting reusable attack rules under diverse constraints. As a result, existing methods struggle to accumulate transferable knowledge over time and to reliably adapt attack strategies across different targets and evolving safety mechanisms. To address these issues, we propose a Self-Evolving Rule-Driven Training-Free Jailbreak (SRTJ) framework that systematically discovers, composes, and refines attack strategies through interaction and feedback, without updating model parameters. Specifically, SRTJ couples experience-driven attack generation with answer set programming (ASP)-based rule selection and constraint-aware composition, where iterative verifier feedback is leveraged to jointly refine successful strategies and analyze failure patterns. The resulting rule memory evolves in a hierarchical multi-level manner, explicitly organizing distilled attack knowledge into long-term, middle-term, and short-term rules, thereby capturing both stable transferable strategies and transient adaptive behaviors to effectively balance exploration and exploitation across attack attempts. Extensive experiments on mainstream jailbreak benchmark (HarmBench) demonstrate that SRTJ achieves strong and stable attack performance across different target LLMs, while exhibiting improved robustness and generalization compared to existing jailbreak methods. The code is available at https://github.com/TheSolkatt/SRTJ.

2605.00972 2026-05-05 physics.data-an cs.AI cs.CV cs.IR

Toward a Scientific Discovery Engine for Weather and Climate Data: A Visual Analytics Workbench for Embedding-Based Exploration

Nihanth W. Cherukuru, Matt Rehme, Kirsten J. Mayer, David John Gagne, John Schreck, John Clyne, Charlie Becker

Comments 5 pages, 3 figures, Preprint

详情
英文摘要

Earth system science is producing increasingly large, high-dimensional datasets from physics based Earth system models to AI-based weather and climate models. Embedding-based representations can make these data searchable through similarity search and analog retrieval, but nearest neighbors in latent space are not automatically scientifically meaningful: it may reflect real weather structure, or preprocessing, geography, or model bias. Researchers therefore need ways to inspect how embeddings organize meteorological data, compare representation models, develop retrieval strategies, and verify results against physical evidence. We present an open-source visual analytics workbench for each of these steps. The system links embedding experiments to source data, metadata, spatial context, and model configurations, so latent-space results can be traced back to the physics. Users can explore latent spaces for different models, issue global or localized queries, and inspect analogs through familiar meteorological views. This enables a discovery workflow in which scientists characterize a phenomenon of interest in a well-understood dataset, identifying its signature in latent space, and then use that signature to probe larger, less-labeled archives or ensembles for similar events. We demonstrate the workbench through tropical-cyclone retrieval using ERA5-derived embeddings and IBTrACS metadata, and evaluate its out-of-core retrieval backend to show that large embedding collections can be searched beyond in-memory limits on commodity workstation hardware.

2605.00971 2026-05-05 eess.IV cs.CV

Reconstruction Interval Z-Phase Dependence of AI Detection Sensitivity in CT Lung Nodule Screening

Dan Soliman

详情
英文摘要

Background: Sensitivity of AI-assisted lung nodule detection systems is known to vary with CT acquisition parameters including radiation dose, reconstruction kernel, and slice thickness. However, the dependence of detection probability on nodule position within the reconstruction cycle -- the z-phase -- has not, to the author's knowledge, been characterized for deep learning-based detection systems. Methods: A retrospective analysis was performed using the LIDC-IDRI dataset. Detection results from a previously validated 154-case perturbation study were re-analyzed. For each consensus nodule (>=4-reader agreement), z-phase was defined as the fractional position of the nodule center within the reconstruction cycle, folded to [0, 0.5]. Detection sensitivity was stratified by z-phase bin, reconstruction interval (1mm, 3mm, 5mm), and by the ratio of reconstruction interval to nodule diameter (d/D). Results: At 5mm reconstruction interval, sensitivity was 71.6% vs 84.8% at 1mm baseline. Within the 5mm condition, sensitivity varied by 17.6 percentage points across z-phase bins. Stratified by d/D ratio, sensitivity was 92.4% for d/D < 0.5, 78.0% for 0.5 <= d/D < 1.0, and 61.4% for d/D >= 1.0, with a systematic z-phase effect present only in the d/D >= 1.0 stratum. Conclusions: AI detection sensitivity depends on the ratio of reconstruction interval to nodule diameter. When this ratio approaches or exceeds 1.0 -- as occurs for 3-6mm nodules at 5mm reconstruction -- z-phase becomes the dominant source of per-study detection variance. This stochastic effect is invisible to protocol-level quality metrics and not reflected in AI confidence scores.

2605.00968 2026-05-05 eess.SP cs.AI

Adaptive 3D-RoPE: Physics-Aligned Rotary Positional Encoding for Wireless Foundation Models

Chenyu Zhang, Xinchen Lyu, Chenshan Ren, Shuhan Liu, Qimei Cui

Comments 13 pages, 7 figures

详情
英文摘要

Positional encoding plays a pivotal role in determin?ing the extrapolation and generalization performance of wireless foundation models for channel state information (CSI) modeling, latent characterization, and task-specific prediction. However, existing CSI models inherit static or one-dimensional positional priors from natural language and vision architectures, which fundamentally misalign with the intrinsic physics of wireless channels by lacking explicit relative decay, collapsing the 3D spatio-temporal-frequency structure, and remaining scenario?rigid. This paper proposes Adaptive 3D-RoPE, a physics-aligned rotary positional encoding that establishes the structural corner?stone for wireless foundation models. The framework integrates a learnable, axis-decoupled 3D frequency bank to explicitly disentangle multi-dimensional phase dependencies, coupled with a lightweight channel-conditioned controller that dynamically modulates the prior via compact global CSI descriptors. This sample-adaptive mechanism transforms positional encoding from a static transformer component into a dynamic, coherence-aware inductive bias to resolve heterogeneous channel physics. Extensive experiments across 100 datasets demonstrate the superiority of the proposed scheme in both scale extrapolation and zero-shot generalization. Compared to the state-of-the-art, our method achieves up to a 10.7 dB reduction in normalized mean square error (NMSE) under 8 times antenna scale extrapolation. Given the same CSI input scales, our method can also improve zero-shot NMSE by 1.07 dB across unseen mobility scenarios and 0.90 dB in low-frequency-to-millimeter-wave tasks.

2605.00964 2026-05-05 cs.IR cs.AI cs.HC

Seeking Information with RAG-Assistants: Does Model Size Matter in Human-AI Collaborations?

Lennard C. Froma, Tom Kouwenhoven, Maaike H. T. de Boer, Catholijn M. Jonker, Max J. van Duijn

详情
英文摘要

Much research on LLMs has focused on increasing benchmark performance. However, the evaluation of such models in real-world collaborative human-AI workflows has stayed behind. This work evaluates a chatbot-style assistant based on Retrieval-Augmented Generation (RAG) in a realistic multi-turn information-seeking scenario inspired by workplace settings where compliance with local legislation and secure handling of sensitive data are often key. Specifically, we examine the performance of humans (N=112) assisted by RAG-assistants compared to LLM-only or LLM+RAG baselines. In this setting, we investigate how underlying model size (3B, 8B, and 70B) shapes the human-AI collaborative dynamic and how it influences perceived usability and satisfaction. Results show that the performance gain of human-AI collaboration over the model-only baselines is significant, irrespective of model size, suggesting that hybrid systems are beneficial in information-seeking scenarios. Interestingly, however, perceived usability and satisfaction among participants showed little difference across model sizes. This demonstrates a nuanced trade-off between model size, performance, and user perception. Our work highlights the added value of evaluating AI applications in actual multi-turn interactions with human users, looking at usability and satisfaction besides accuracy, rather than focusing on benchmark performance only.

2605.00957 2026-05-05 cs.IR cs.AI

"I Don't Know" -- Towards Appropriate Trust with Certainty-Aware Retrieval Augmented Generation

Daan Di Scala, Maaike de Boer, Pınar Yolum

Comments To be published in VALE 2025 Proceedings

详情
英文摘要

Achieving the right amount of trust in AI systems is important, but challenging. The problem is exacerbated with the rise of Large Language Models (LLMs) as they provide human-level communication capabilities, but potentially hallucinate in the content that they generate. Moreover, they express over-confidence in their answers, making it difficult for users to judge their truthfulness. An important human value that users seek is benevolence, which can be met by LLM's self-reflection leading to reliable and honest answers. Accordingly, this paper proposes conveying appropriate levels of self-reflected certainty to build appropriate trust. Our contributions are twofold: 1) We develop CERTA (Certainty Enhanced RAG for Trustworthy Answers), a specialized Retrieval Augmented Generation (RAG) system that incorporates the relevance between question, context, and answer to reflect its uncertainty in answering questions; 2) We create the Certainty Benchmark with 90 question-context pairs of non-objective questions, divided over four categories (factuality, preference, sycophancy, morality) and three types of contexts (relevant, incomplete, irrelevant). We run experiments with a baseline RAG system and three CERTA settings using two LLMs. Our evaluations indicate that CERTA helps identify answers that are uncertain, decreases the cases of over-agreeing, and provides cautious behavior when prompted for moral judgments.

2605.00955 2026-05-05 cs.CR cs.AI

E-MIA: Exam-Style Black-Box Membership Inference Attacks against RAG Systems

Zelin Guan, Shengda Zhuo, Zeyan Li, Jinchun He, Wangjie Qiu, Zhiming Zheng, Shuqiang Huang

详情
英文摘要

Retrieval-Augmented Generation (RAG) equips large language models (LLMs) with external evidence by retrieving documents at inference time, but it also turns the retrieval corpusinto a sensitive asset. Under a black-box setting, an adversary given a candidate document can infer whether it has been ingested into the RAG knowledge base (i.e., document-level membership inference) solely from query response interactions, thereby leaking corpus coverage and the existence of sensitive topics. Existing RAG MIA methods either rely on soft signals such as semantic similarity, which often yield overlapping member/non-member score distributions and unstable thresholds, or employ explicit confirmation probes whose intent is conspicuous and thus prone to refusal and detection. We propose E-MIA, which converts verifiable hard evidence in the target document (e.g., fine-grained details, proper nouns/technical terms, definitional statements, metadata cues, and causal/constraint relations) into an exam with four objectively gradable question types (FB/SC/MC/T/F), and uses the aggregated exam score across multiple evidence targeted questions as the membership signal. Experiments across multiple datasets and diverse RAG configurations demonstrate that E-MIA improves member/non-member separability in stringent settings while preserving natural, stealthy queries, and we further analyze the impact of question composition and exam length on attack effectiveness.

2605.00948 2026-05-05 q-bio.QM cs.AI

Co-Generative De Novo Functional Protein Design

Xinrui Chen, Yizhen Luo, Siqi Fan, Zaiqing Nie

详情
英文摘要

De novo functional protein design aims to generate protein sequences that realize specified biochemical functions without relying on evolutionary templates, enabling broad applications in biotechnology and medicine. Existing approaches adopt either direct function-to-sequence mapping or decoupled structure-sequence generation strategies but often fail to achieve functionality and foldability simultaneously. To address this, we propose CodeFP, a Co-generative protein language model for de novo Functional Protein design that simultaneously decodes sequence and structure tokens, thereby enabling superior simultaneous realization of functionality and foldability. CodeFP utilizes functional local structures to enrich functional semantic encodings, overcoming the suboptimal translation of flat encodings into structure tokens, while introducing auxiliary functional supervision to alleviate training ambiguity stemming from the one-to-many structure-to-token mapping. Extensive experiments show that CodeFP consistently achieves average improvements of 6.1% in functional consistency and 3.2% in foldability over the strongest baseline.

2605.00944 2026-05-05 cs.IR cs.AI cs.CL

SCARV: Structure-Constrained Aggregation for Stable Sample Ranking in Redundant NLP Datasets

Xu Zheng, Feiyu Wu, Linhong Wu, Zhuocheng Wang, Hui Li

详情
英文摘要

Sample-level rankings are increasingly used in data-centric NLP for analysis, filtering, debugging, and curation, yet existing pipelines typically score training examples pointwise and rank them as if they were independent. This assumption is fragile in the presence of exact duplicates, near-duplicates, paraphrases, and other redundant structure common in NLP corpora, where stochastic training can make highly similar examples receive unstable relative orderings across random seeds. We study stable sample-level ranking under redundancy and propose \textsc{SCARV}, a modular aggregation framework that operates on top of an existing scoring proxy. \textsc{SCARV} combines robust multi-seed aggregation with a structure-aware aggregation/allocation step over redundancy clusters. Across synthetic redundancy, naturally mined QQP redundancy, multiple proxy families, several NLP tasks, and end-to-end DistilBERT fine-tuning, \textsc{SCARV} substantially improves over bare proxy rankings in global and local stability and yields more reproducible ranking-based decisions such as subset selection and suspicious-example retrieval. Our decomposition and compute-aware frontier sharpen the mechanism: robust multi-seed aggregation is the dominant generic stabilizer, while the structure-aware component adds value mainly under low aggregation budgets or when redundancy clusters are informative, naturally occurring, or sufficiently covered. These results position \textsc{SCARV} not as a universal data selector or a universally dominant replacement for seed-only aggregation, but as a stability-oriented aggregation layer for proxy-induced rankings in redundant NLP datasets.

2605.00942 2026-05-05 cs.SE cs.LG

PPO guided Agentic Pipeline for Adaptive Prompt Selection and Test Case Generation

Gourisetty Venkata Sai Koushik, Dama Aditya, Mahankali Harish Sai, Peddi Siddarhta, Shadab Ahmad, Vivek Yelleti

详情
英文摘要

Developing effective test cases capable of thoroughly exercising large-scale software systems is inherently difficult, especially if such systems have voluminous, complex, and deeply nested source codes. In this work, we present a novel approach for generating test cases using a reinforcement learning-driven agentic framework where Proximal Policy Optimization (PPO) is coupled with an LLM engine to guide prompt selection during test generation. Our approach consists of two phases. In Phase I, the ToT-guided optimization agent partitions and minimizes the source code by removing redundancies without changing the functional behavior of the source code. In Phase II, a PPO-based policy network is trained to solve the problem of selecting prompts among eight different prompting techniques, such as Boundary Value Analysis, Random Fuzzing, etc., based on the inputted 11-dimensional state vector representing the source code complexity metrics and live coverage metrics to direct the LLM engine towards exploring unvisited paths in the program. The PPO agent receives rewards based on a combination of increases in line and branch coverages, penalties for unexplored branches, and rewards for reducing source code length. From experiments conducted on twenty benchmark programs, it is evident that the proposed approach, PPO-LLM, outperforms CBMC, kS-LLM, and kS-LLM++ in terms of branch and line coverage in almost all cases, for various loop bound values ranging from BOUND~1 to BOUND~2000. While at BOUND~1, the coverage of branches is 100\% using PPO-LLM on the PALS suite, in comparison, it is around 86.8\% using kS-LLM++. This confirms that adaptive prompt selection driven by PPO substantially outperforms static prompting strategies on PALS type programs.