arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 1652
2604.19851 2026-04-23 cs.GT cs.AI

Is Four Enough? Automated Reasoning Approaches and Dual Bounds for Condorcet Dimensions of Elections

Itai Zilberstein, Ratip Emin Berker, George Li, Ruben Martins

Comments Appears at the 8th Games, Agents, and Incentives Workshop (GAIW-26). Held as part of the Workshops at the 25th International Conference on Autonomous Agents and Multiagent Systems

详情
英文摘要

In an election where $n$ voters rank $m$ candidates, a Condorcet winning set is a committee of $k$ candidates such that for any outside candidate, a majority of voters prefer some committee member. Condorcet's paradox shows that some elections admit no Condorcet winning sets with a single candidate (i.e., $k=1$), and the same can be shown for $k=2$. On the other hand, recent work proves that a set of size $k=5$ exists for every election. This leaves an important theoretical gap between the best known lower bound $(k\geq 3)$ and upper bound $(k \leq 5)$ for the number of candidates needed to guarantee existence. We aim to close the gap between the existence guarantees and impossibility results for Condorcet winning sets. We explore an automated reasoning approach to tighten these bounds. We design a mixed-integer linear program (MILP) to search for elections that would serve as counter-examples to conjectured bounds. We employ a number of optimizations, such as symmetry breaking, subsampling, and constraint generation, to enhance the search and model effectively infinite electorates. Furthermore, we analyze the dual of the linear programming relaxation as a path towards obtaining a new upper bound. Despite extensive search on moderate-sized elections, we fail to find any election requiring a committee larger than size 3. Motivated by our experimental results in this direction, we simplify the dual linear program and formulate a conjecture which, if true, implies that a winning set of size 4 always exists. Our automated reasoning results provide strong empirical evidence that the Condorcet dimension of any election may be smaller than currently known upper bounds, at least for small instances. We offer a general-purpose framework for searching elections in ranked voting and a new, concrete analytical path via duality toward proving that smaller committees suffice.

2604.19850 2026-04-23 cs.ET cs.LG cs.NE q-bio.MN q-bio.QM

What Makes a Bacterial Model a Good Reservoir Computer? Predicting Performance from Separability and Similarity

Laura Alonso Bartolomé, Jean-Loup Faulon, Xavier Hinaut

详情
英文摘要

Biological systems are promising substrates for computation because they naturally process environmental information through complex internal dynamics. In this study, we investigate whether bacterial metabolic models can act as physical reservoirs and whether their computational performance can be predicted from dynamical properties linked to separability and similarity. We simulated the growth dynamics of five bacterial species, one yeast species, and 29 Escherichia coli single-gene deletion mutants using dynamic flux balance analysis (dFBA), with glucose and xylose concentrations as inputs and growth curves as reservoir states. Computational performance was assessed on random nonlinear classification tasks using a linear readout, while reservoir properties linked to separability and similarity were characterised through kernel and generalisation ranks computed from growth-curve state matrices. Several microbial models achieved high classification accuracy, showing that bacterial metabolic dynamics can support nonlinear computation. Clear differences were observed between species, with some models converging more rapidly and others reaching higher maximum accuracy, revealing a trade-off between convergence speed and peak performance. In contrast, all E. coli mutants were dominated by the wild-type model, suggesting that gene deletions reduce the dynamical richness required for efficient computation. The difference between kernel and generalisation ranks was generally associated with improved accuracy, but deviations across models and sensitivity at low rank values limited its predictive power in practice. Overall, these results show that bacterial metabolic models constitute promising substrates for reservoir computing and provide a first step towards identifying microbial strains with favourable computational properties for future experimental implementations.

2604.19846 2026-04-23 hep-ex astro-ph.HE astro-ph.IM cs.AI cs.LG

Neural posterior estimation of the neutrino direction in IceCube using transformer-encoded normalizing flows on the sphere

R. Abbasi, M. Ackermann, J. Adams, J. A. Aguilar, M. Ahlers, J. M. Alameddine, S. Ali, N. M. Amin, K. Andeen, C. Argüelles, Y. Ashida, S. Athanasiadou, S. N. Axani, R. Babu, X. Bai, A. Balagopal V., S. W. Barwick, V. Basu, R. Bay, J. J. Beatty, J. Becker Tjus, P. Behrens, J. Beise, C. Bellenghi, S. Benkel, S. BenZvi, D. Berley, E. Bernardini, D. Z. Besson, E. Blaufuss, L. Bloom, S. Blot, F. Bontempo, J. Y. Book Motzkin, C. Boscolo Meneguolo, S. Böser, O. Botner, J. Böttcher, J. Braun, B. Brinson, Z. Brisson-Tsavoussis, R. T. Burley, D. Butterfield, K. Carloni, J. Carpio, N. Chau, Z. Chen, D. Chirkin, S. Choi, A. Chubarov, B. A. Clark, G. H. Collin, D. A. Coloma Borja, A. Connolly, J. M. Conrad, D. F. Cowen, C. De Clercq, J. J. DeLaunay, D. Delgado, T. Delmeulle, S. Deng, P. Desiati, K. D. de Vries, G. de Wasseige, T. DeYoung, J. C. Díaz-Vélez, S. DiKerby, T. Ding, M. Dittmer, A. Domi, L. Draper, L. Dueser, D. Durnford, K. Dutta, M. A. DuVernois, T. Ehrhardt, L. Eidenschink, A. Eimer, C. Eldridge, P. Eller, E. Ellinger, D. Elsässer, R. Engel, H. Erpenbeck, W. Esmail, S. Eulig, J. Evans, P. A. Evenson, K. L. Fan, K. Fang, K. Farrag, A. R. Fazely, A. Fedynitch, N. Feigl, C. Finley, D. Fox, A. Franckowiak, S. Fukami, P. Fürst, J. Gallagher, E. Ganster, A. Garcia, M. Garcia, E. Genton, L. Gerhardt, A. Ghadimi, C. Glaser, T. Glüsenkamp, J. G. Gonzalez, S. Goswami, A. Granados, D. Grant, S. J. Gray, S. Griffin, K. M. Groth, D. Guevel, C. Günther, P. Gutjahr, C. Ha, A. Hallgren, L. Halve, F. Halzen, L. Hamacher, M. Handt, K. Hanson, J. Hardin, A. A. Harnisch, P. Hatch, A. Haungs, J. Häußler, K. Helbing, J. Hellrung, B. Henke, L. Hennig, F. Henningsen, L. Heuermann, R. Hewett, N. Heyer, S. Hickford, A. Hidvegi, C. Hill, G. C. Hill, R. Hmaid, K. D. Hoffman, A. Hollnagel, D. Hooper, S. Hori, K. Hoshina, M. Hostert, W. Hou, M. Hrywniak, T. Huber, K. Hultqvist, K. Hymon, A. Ishihara, W. Iwakiri, M. Jacquart, S. Jain, O. Janik, M. Jansson, M. Jin, N. Kamp, D. Kang, W. Kang, A. Kappes, L. Kardum, T. Karg, A. Karle, A. Katil, M. Kauer, J. L. Kelley, M. Khanal, A. Khatee Zathul, A. Kheirandish, T. Kim, H. Kimku, F. Kirchner, J. Kiryluk, C. Klein, S. R. Klein, Y. Kobayashi, S. Koch, A. Kochocki, R. Koirala, H. Kolanoski, T. Kontrimas, L. Köpke, C. Kopper, D. J. Koskinen, P. Koundal, M. Kowalski, T. Kozynets, A. Kravka, N. Krieger, T. Krishnan, K. Kruiswijk, E. Krupczak, A. Kumar, E. Kun, N. Kurahashi, C. Lagunas Gualda, L. Lallement Arnaud, M. J. Larson, F. Lauber, J. P. Lazar, K. Leonard DeHolton, A. Leszczyńska, C. Li, J. Liao, C. Lin, Q. R. Liu, Y. T. Liu, M. Liubarska, C. Love, L. Lu, F. Lucarelli, W. Luszczak, Y. Lyu, M. Macdonald, E. Magnus, Y. Makino, E. Manao, S. Mancina, A. Mand, I. C. Mariş, S. Marka, Z. Marka, L. Marten, I. Martinez-Soler, R. Maruyama, J. Mauro, F. Mayhew, F. McNally, K. Meagher, A. Medina, M. Meier, Y. Merckx, L. Merten, J. Mitchell, L. Molchany, S. Mondal, T. Montaruli, R. W. Moore, Y. Morii, A. Mosbrugger, D. Mousadi, E. Moyaux, T. Mukherjee, M. Nakos, U. Naumann, J. Necker, L. Neste, M. Neumann, H. Niederhausen, M. U. Nisa, K. Noda, A. Noell, A. Novikov, A. Obertacke, V. O'Dell, A. Olivas, R. Orsoe, J. Osborn, E. O'Sullivan, B. Owens, V. Palusova, H. Pandya, A. Parenti, N. Park, V. Parrish, E. N. Paudel, L. Paul, C. Pérez de los Heros, T. Pernice, T. C. Petersen, J. Peterson, S. Pick, M. Plum, A. Pontén, V. Poojyam, B. Pries, R. Procter-Murphy, G. T. Przybylski, L. Pyras, C. Raab, J. Rack-Helleis, N. Rad, M. Ravn, K. Rawlins, Z. Rechav, A. Rehman, I. Reistroffer, E. Resconi, S. Reusch, C. D. Rho, W. Rhode, L. Ricca, B. Riedel, A. Rifaie, E. J. Roberts, S. Rodan, M. Rongen, A. Rosted, C. Rott, T. Ruhe, L. Ruohan, D. Ryckbosch, J. Saffer, D. Salazar-Gallegos, P. Sampathkumar, A. Sandrock, G. Sanger-Johnson, M. Santander, S. Sarkar, M. Scarnera, M. Schaufel, H. Schieler, S. Schindler, L. Schlickmann, B. Schlüter, F. Schlüter, N. Schmeisser, T. Schmidt, A. Scholz, F. G. Schröder, S. Schwirn, S. Sclafani, D. Seckel, L. Seen, M. Seikh, S. Seunarine, P. A. Sevle Myhr, R. Shah, S. Shah, S. Shefali, N. Shimizu, B. Skrzypek, R. Snihur, J. Soedingrekso, D. Soldin, P. Soldin, G. Sommani, C. Spannfellner, G. M. Spiczak, C. Spiering, J. Stachurska, M. Stamatikos, T. Stanev, T. Stezelberger, T. Stürwald, T. Stuttard, G. W. Sullivan, I. Taboada, S. Ter-Antonyan, A. Terliuk, A. Thakuri, M. Thiesmeyer, W. G. Thompson, J. Thwaites, S. Tilav, K. Tollefson, J. A. Torres, S. Toscano, D. Tosi, K. Upshaw, A. Vaidyanathan, N. Valtonen-Mattila, J. Valverde, J. Vandenbroucke, T. Van Eeden, N. van Eijndhoven, L. Van Rootselaar, J. van Santen, J. Vara, F. Varsi, M. Venugopal, M. Vereecken, S. Vergara Carrasco, S. Verpoest, D. Veske, A. Vijai, J. Villarreal, C. Walck, A. Wang, E. H. S. Warrick, C. Weaver, P. Weigel, A. Weindl, J. Weldert, A. Y. Wen, C. Wendt, J. Werthebach, M. Weyrauch, N. Whitehorn, C. H. Wiebusch, D. R. Williams, L. Witthaus, G. Wrede, X. W. Xu, J. P. Yanez, Y. Yao, E. Yildizci, S. Yoshida, R. Young, F. Yu, S. Yu, T. Yuan, S. Yun-Cárcamo, A. Zander Jurowitzki, A. Zegarelli, S. Zhang, Z. Zhang, P. Zhelnin, P. Zilberman

详情
英文摘要

IceCube is a cubic-kilometer-scale neutrino detector located at the geographic South Pole. A precise directional reconstruction of IceCube neutrinos is vital for associations with astronomical objects. In this context, we discuss neural posterior estimation of the neutrino direction via a transformer encoder that maps to a normalizing flow on the 2-sphere. It achieves a new state-of-the-art angular resolution for the two main event morphologies in IceCube - tracks and showers - while being significantly faster than traditional B-spline-based likelihood reconstructions. All-sky scans can be performed within seconds rather than hours, and take constant computation time, regardless of whether the posterior extent is arc-minutes or spans the whole sky. We utilize a combination of $C^2$-smooth rational-quadratic splines, scale transformations and rotations to define a novel spherical normalizing-flow distribution whose parameters are predicted as a whole as the output of the transformer encoder. We test several structural choices diverting from the vanilla transformer architecture. In particular, we find dual residual streams, nonlinear QKV projection and a separate class token with its own cross-attention processing to boost test-time performance. The angular resolution for both showers and tracks improves substantially over the whole trained energy range from 100 GeV to 100 PeV. At 100 TeV deposited energy, for example, the median angular resolution improves by a factor of $1.3$ for throughgoing tracks, by a factor of $1.7$ for showers and by a factor of $2.5$ for starting tracks compared to state-of-the art likelihood reconstructions based on B-splines. While previous machine-learning (ML) efforts have managed to obtain competitive shower resolutions, this is the first time an ML-based method outperforms likelihood-based muon reconstructions above 100 GeV.

2604.19841 2026-04-23 stat.AP cs.LG

Spatio-temporal modelling of electric vehicle charging demand

Kaoutar Bouaachra, Yvenn Amara-Ouali, Yannig Goude, Raphaël Lachieze-Rey

Comments 18 pages, 19 figures

详情
英文摘要

Accurate forecasting of electric vehicle (EV) charging demand is critical for grid management and infrastructure planning. Yet the field continues to rely on legacy benchmarks; such as the Palo Alto (2020) dataset; that fail to reflect the scale and behavioral diversity of modern charging networks. To address this, we introduce a novel large-scale longitudinal dataset collected across Scotland (2022 2025), which release it as an open benchmark for the community. Building on this dataset, we formulate EV charging demand as a spatio-temporal latent Gaussian field and perform approximate Bayesian inference via Integrated Nested Laplace Approximation (INLA). The resulting model jointly captures spatial dependence, temporal dynamics, and covariate effects within a unified proba bilistic framework. On station-level forecasting tasks, our approach achieves competitive predictive accuracy against machine learning baselines, while additionally providing principled uncertainty quan tification and interpretable spatial and temporal decompositions properties that are essential for risk-aware infrastructure planning.

2604.19832 2026-04-23 quant-ph cs.LG

Option Pricing on Noisy Intermediate-Scale Quantum Computers: A Quantum Neural Network Approach

Sebastian Zając, Rafał Pracht

详情
英文摘要

In a global derivatives market with notional values in the hundreds of trillions of dollars, the accuracy and efficiency of pricing models are of fundamental importance, with direct implications for risk management, capital allocation, and regulatory compliance. In this work, we employ the Black-Scholes-Merton (BSM) framework not as an end in itself, but as a controlled benchmark environment in which to rigorously assess the capabilities of quantum machine learning methods. We propose a fully quantum approach to option pricing based on Quantum Neural Networks (QNNs), and, to the best of our knowledge, present one of the first implementations of such a methodology on currently available quantum hardware. Specifically, we investigate whether QNNs, by exploiting the geometric structure of Hilbert space, can effectively approximate option pricing functions. Our implementation utilizes a compact 2-qubit QNN architecture evaluated across multiple state-of-the-art quantum processors, including IBM Fez, IQM Garnet, IonQ Forte, and Rigetti Ankaa-3. This cross-platform study reveals distinct hardware-dependent performance characteristics while demonstrating that accurate pricing approximations can be achieved consistently across different devices despite the constraints of Noisy Intermediate-Scale Quantum (NISQ) hardware. The results provide empirical evidence that QNN-based approaches constitute a viable framework for derivative pricing. While the analysis is conducted within the BSM setting, the broader significance lies in the potential extension of these methods to more realistic and computationally demanding models, including local volatility, stochastic volatility, and interest rate frameworks commonly used in practice.

2604.19827 2026-04-23 cs.SE cs.AI

More Is Different: Toward a Theory of Emergence in AI-Native Software Ecosystems

Daniel Russo

详情
英文摘要

Software engineering faces a fundamental challenge: multi-agent AI systems fail in ways that defy explanation by traditional theories. While individual agents perform correctly, their interactions degrade entire ecosystems, revealing a gap in our understanding of software evolution. This paper argues that AI-native software ecosystems must be studied as complex adaptive systems (CAS), where emergent properties like architectural entropy, cascade failures, and comprehension debt arise not from individual components, but from their interactions. We map Holland's six CAS properties onto observable ecosystem dynamics, distinguishing these systems from microservices or open-source networks. To measure causal emergence, we define micro-level state variables, coarse-graining functions, and a tractable measurement framework. Seven falsifiable propositions link CAS theory to software evolution, challenging or extending Lehman's laws where agent-level assumptions fail. If confirmed, these findings would demand a radical shift: ecosystem-level monitoring as the primary governance mechanism for AI-native systems. If refuted, existing theories may only need incremental updates. Either way, this work forces us to ask: Can software engineering's core assumptions survive the age of autonomous agents?

2604.19826 2026-04-23 cs.SE cs.AI cs.LG

Co-Located Tests, Better AI Code: How Test Syntax Structure Affects Foundation Model Code Generation

Éric Jacopin

Comments 20 pages. Preprint; arXiv long version of a paper accepted at AIware 2026. Adds Appendices A (cross-language) and B (Python isolation) not present in the ACM camera-ready

详情
英文摘要

AI coding assistants increasingly generate code alongside tests. How developers structure test code, whether inline with the implementation or in separate blocks, has traditionally been a matter of testing philosophy. We investigate whether this choice affects AI code generation quality. We conduct a large-scale empirical study (830+ generated files, 12 models, 3 providers) using SEGA, a three-dimensional evaluation framework measuring Determinism, Preservation, and Correctness. Comparing inline test syntax (Python doctests) against separated test syntax (Rust #[test] blocks) on a d-ary heap implementation, we find that: (1) inline tests yield near-perfect preservation (100%) and correctness (92-100%) across all models; (2) separated tests expose stark model-tier gaps (0-100% correctness) and independence between preservation and correctness; (3) model behavior evolves across generations, and notably one model breaks the test suppression pattern of its three predecessors; (4) mechanistic analysis on 7 open-source architectures (6 transformers and a gated-linear Recurrent Neural Network (RNN)) reveals inline test markers receive 2.8-4.4$\times$ stronger attention in 5/7 models, with causal validation via knockout and steering experiments on the 4 code-specialized transformers and RWKV-6; the co-location mechanism extends to a non-transformer architecture, suggesting the design recommendation is robust to future architectural shifts. In the Foundation Model era, test syntax structure is a software design concern: co-locating tests with implementation code produces measurably better AI-generated code. This arxiv long version includes appendices that further qualify the effect as bounded by both model capability and programming language.

2604.19825 2026-04-23 cs.SE cs.AI

SolidCoder: Bridging the Mental-Reality Gap in LLM Code Generation through Concrete Execution

Woojin Lee, Jin-Xia Huang

Comments 23 pages, 2 figures, Accepted at Findings of ACL 2026

详情
英文摘要

State-of-the-art code generation frameworks rely on mental simulation, where LLMs internally trace execution to verify correctness. We expose a fundamental limitation: the Mental-Reality Gap -- where models hallucinate execution traces and confidently validate buggy code. This gap manifests along two orthogonal dimensions: the Specification Gap (overlooking edge cases during planning) and the Verification Gap (hallucinating correct behavior for flawed code). We propose SolidCoder with a simple principle: don't imagine -- execute. The S.O.L.I.D. architecture addresses both dimensions by forcing edge-case awareness before algorithm design and replacing imagined traces with sandboxed execution using property-based oracles. With GPT-4o, SolidCoder achieves state-of-the-art pass@1 performance: 95.7% on HumanEval (+0.6%p), 77.0% on CodeContests (+4.3%p), and 26.7% on APPS (+3.4%p). Ablation reveals that edge-case awareness provides the largest individual gain, while execution grounding catches categorically different errors that specification improvements cannot address. These gains generalize to RL post-trained models, validating that bridging both gap dimensions is essential for robust code synthesis. We release our code and framework to facilitate future research.

2604.19806 2026-04-23 physics.chem-ph cs.AI cs.LG

Improving Molecular Force Fields with Minimal Temporal Information

Ali Mollahosseini, Mohammed Haroon Dupty, Wee Sun Lee

详情
英文摘要

Accurate prediction of energy and forces for 3D molecular systems is one of fundamental challenges at the core of AI for Science applications. Many powerful and data-efficient neural networks predict molecular energies and forces from single atomic configurations. However, one crucial aspect of the data generation process is rarely considered while learning these models i.e. Molecular Dynamics (MD) simulation. MD simulations generate time-ordered trajectories of atomic positions that fluctuate in energy and explore regions of the potential energy surface (e.g., under standard NVE/NVT ensembles), rather than being constructed to steadily lower the potential energy toward a minimum as in geometry relaxations. This work explores a novel way to leverage MD data, when available, to improve the performance of such predictors. We introduce a novel training strategy called FRAMES, that use an auxiliary loss function for exploiting the temporal relationships within MD trajectories. Counter-intuitively, on two atomistic benchmarks and a synthetic system we observe that minimal temporal information, captured by pairs of just two consecutive frames, is often sufficient to obtain the best performance, while adding longer trajectory sequences can introduce redundancy and degrade performance. On the widely used MD17 and ISO17 benchmarks, FRAMES significantly outperforms its Equiformer baseline, achieving highly competitive results in both energy and force accuracy. Our work not only presents a novel training strategy which improves the accuracy of the model, but also provides evidence that for distilling physical priors of atomic systems, more temporal data is not always better.

2604.19801 2026-04-23 eess.AS cs.AI cs.CL

Utterance-Level Methods for Identifying Reliable ASR-Output for Child Speech

Gus Lathouwers, Lingyun Gao, Catia Cucchiarini, Helmer Strik

Comments Submitted for Interspeech 2026, currently under review

详情
英文摘要

Automatic Speech Recognition (ASR) is increasingly used in applications involving child speech, such as language learning and literacy acquisition. However, the effectiveness of such applications is limited by high ASR error rates. The negative effects can be mitigated by identifying in advance which ASR-outputs are reliable. This work aims to develop two novel approaches for selecting reliable ASR-output at the utterance level, one for selecting reliable read speech and one for dialogue speech material. Evaluations were done on an English and a Dutch dataset, each with a baseline and finetuned model. The results show that utterance-level selection methods for identifying reliably transcribed speech recordings have high precision for the best strategy (P > 97.4) for both read speech and dialogue material, for both languages. Using the current optimal strategy allows 21.0% to 55.9% of dialogue/read speech datasets to be automatically selected with low (UER of < 2.6) error rates.

2604.19799 2026-04-23 cs.HC cs.AI cs.CY q-bio.NC

Measuring Creativity in the Age of Generative AI: Distinguishing Human and AI-Generated Creative Performance in Hiring and Talent Systems

Yigal Rosen, Ilia Rushkin

Comments Research Paper Presented at the BIG.AI@MIT Conference, April 2, 2026

详情
英文摘要

Generative AI is rapidly transforming how organizations create value and evaluate talent. While large language models enhance baseline output quality, they simultaneously introduce ambiguity in assessing human creativity, as observable artifacts may be partially or fully AI-generated. This paper reconceptualizes creativity as a distributional and process-based property that emerges under shared constraints and competitive incentives. We introduce a quantitative framework for measuring creativity as novelty in synthesis, operationalized through idea generation and idea transformation within embedding space. Empirical evaluation demonstrates that the proposed metrics align with intuitive judgments of creativity while capturing distinctions that surface-level quality assessments miss. We further identify a structural shift toward bimodal distributions of creative output in AI-mediated environments, with implications for hiring, leadership, and competitive strategy. The findings suggest that in the age of generative AI, distinctiveness rather than fluency becomes the primary signal of human creative capability.

2604.19798 2026-04-23 cs.CY cs.CV econ.EM

Diagnosing Urban Street Vitality via a Visual-Semantic and Spatiotemporal Framework for Street-Level Economics

Xinxin Zhuo, Mengyuan Niu, Ruizhe Wang, Junyan Yang, Qiao Wang

Comments Submitted to ACM Transactions on Spatial Computing. This paper is currently under review

详情
英文摘要

Micro-scale street-level economic assessment is fundamental for precision spatial resource allocation. While Street View Imagery (SVI) advances urban sensing, existing approaches remain semantically superficial and overlook brand hierarchy heterogeneity and structural recession. To address this, we propose a visual-semantic and field-based spatiotemporal framework, operationalized via the Street Economic Vitality Index (SEVI). Our approach integrates physical and semantic streetscape parsing through instance segmentation of signboards, glass interfaces, and storefront closures. A dual-stage VLM-LLM pipeline standardizes signage into global hierarchies to quantify a spatially smoothed brand premium index. To overcome static SVI limitations, we introduce a temporal lag design using Location-Based Services (LBS) data to capture realized demand. Combined with a category-weighted Gaussian spillover model, we construct a three-dimensional diagnostic system covering Commercial Activity, Spatial Utilization, and Physical Environment. Experiments based on time-lagged geographically weighted regression across eight tidal periods in Nanjing reveal quasi-causal spatiotemporal heterogeneity. Street vibrancy arises from interactions between hierarchical brand clustering and mall-induced externalities. High-quality interfaces show peak attraction during midday and evening, while structural recession produces a lagged nighttime repulsion effect. The framework offers evidence-based support for precision spatial governance.

2604.19797 2026-04-23 eess.AS cs.AI cs.CL

Enhancing ASR Performance in the Medical Domain for Dravidian Languages

Sri Charan Devarakonda, Ravi Sastry Kolluru, Manjula Sri Rayudu, Rashmi Kapoor, Madhu G, Anil Kumar Vuppala

详情
英文摘要

Automatic Speech Recognition (ASR) for low-resource Dravidian languages like Telugu and Kannada faces significant challenges in specialized medical domains due to limited annotated data and morphological complexity. This work proposes a novel confidence-aware training framework that integrates real and synthetic speech data through a hybrid confidence mechanism combining static perceptual and acoustic similarity metrics with dynamic model entropy. Unlike direct fine-tuning approaches, the proposed methodology employs both fixed-weight and learnable-weight confidence aggregation strategies to guide sample weighting during training, enabling effective utilization of heterogeneous data sources. The framework is evaluated on Telugu and Kannada medical datasets containing both real recordings and TTS-generated synthetic speech. A 5-gram KenLM language model is applied for post-decoding correction. Results show that the hybrid confidence-aware approach with learnable weights substantially reduces recognition errors: Telugu Word Error Rate (WER) decreases from 24.3% to 15.8% (8.5% absolute improvement), while Kannada WER drops from 31.7% to 25.4% (6.3% absolute improvement), both significantly outperforming standard fine-tuning baselines. These findings confirm that combining adaptive confidence-aware training with statistical language modeling delivers superior performance for domain-specific ASR in morphologically complex Dravidian languages.

2604.19781 2026-04-23 cs.CY cs.AI cs.CL

Do Small Language Models Know When They're Wrong? Confidence-Based Cascade Scoring for Educational Assessment

Tyler Burleigh

Comments 12 pages, 7 figures. Accepted at NCME 2026

详情
英文摘要

Automated scoring of student work at scale requires balancing accuracy against cost and latency. In "cascade" systems, small language models (LMs) handle easier scoring tasks while escalating harder ones to larger LMs -- but the challenge is determining which cases to escalate. We explore verbalized confidence -- asking the LM to state a numerical confidence alongside its prediction -- as a routing signal. Using 2,100 expert-scored decisions from student-AI math conversations, we evaluate cascade systems built from GPT-5.4, Claude 4.5+, and Gemini 3.1 model pairs. We find that: (1) confidence discrimination varies widely across small LMs, with the best achieving AUROC 0.857 and the worst producing a near-degenerate confidence distribution; (2) confidence tracks human scoring difficulty, with lower LM confidence where annotators disagreed and took longer to score; (3) the best cascade approached large-LM accuracy (kappa 0.802 vs. 0.819) at 76% lower cost and 61% lower latency. Confidence discrimination is the bottleneck: the two small LMs with meaningful confidence variance yielded cascades with no statistically detectable kappa loss, while the third -- whose confidence was near-degenerate -- could not close the accuracy gap regardless of threshold. Small LMs with strong discrimination let practitioners trade cost for accuracy along the frontier; those without it do not.

2604.19763 2026-04-23 eess.AS cs.AI cs.CL

Explainable Speech Emotion Recognition: Weighted Attribute Fairness to Model Demographic Contributions to Social Bias

Tomisin Ogunnubi, Yupei Li, Björn Schuller

Comments 5 pages, 4 figures

详情
英文摘要

Speech Emotion Recognition (SER) systems have growing applications in sensitive domains such as mental health and education, where biased predictions can cause harm. Traditional fairness metrics, such as Equalised Odds and Demographic Parity, often overlook the joint dependency between demographic attributes and model predictions. We propose a fairness modelling approach for SER that explicitly captures allocative bias by learning the joint relationship between demographic attributes and model error. We validate our fairness metric on synthetic data, then apply it to evaluate HuBERT and WavLM models finetuned on the CREMA-D dataset. Our results indicate that the proposed fairness model captures more mutual information between protected attributes and biases and quantifies the absolute contribution of individual attributes to bias in SSL-based SER models. Additionally, our analysis reveals indications of gender bias in both HuBERT and WavLM.

2604.19752 2026-04-23 cs.MA cs.AI cs.CY

Soft-Label Governance for Distributional Safety in Multi-Agent Systems

Aizierjiang Aiersilan, Raeli Savitt

详情
英文摘要

Multi-agent AI systems exhibit emergent risks that no single agent produces in isolation. Existing safety frameworks rely on binary classifications of agent behavior, discarding the uncertainty inherent in proxy-based evaluation. We introduce SWARM (\textbf{S}ystem-\textbf{W}ide \textbf{A}ssessment of \textbf{R}isk in \textbf{M}ulti-agent systems), a simulation framework that replaces binary good/bad labels with \emph{soft probabilistic labels} $p = P(v{=}+1) \in [0,1]$, enabling continuous-valued payoff computation, toxicity measurement, and governance intervention. SWARM implements a modular governance engine with configurable levers (transaction taxes, circuit breakers, reputation decay, and random audits) and quantifies their effects through probabilistic metrics including expected toxicity $\mathbb{E}[1{-}p \mid \text{accepted}]$ and quality gap $\mathbb{E}[p \mid \text{accepted}] - \mathbb{E}[p \mid \text{rejected}]$. Across seven scenarios with five-seed replication, strict governance reduces welfare by over 40\% without improving safety. In parallel, aggressively internalizing system externalities collapses total welfare from a baseline of $+262$ down to $-67$, while toxicity remains invariant. Circuit breakers require careful calibration; overly restrictive thresholds severely diminish system value, whereas an optimal threshold balances moderate welfare with minimized toxicity. Companion experiments show soft metrics detect proxy gaming by self-optimizing agents passing conventional binary evaluations. This basic governance layer applies to live LLM-backed agents (Concordia entities, Claude, GPT-4o Mini) without modification. Results show distributional safety requires \emph{continuous} risk metrics and governance lever calibration involves quantifiable safety-welfare tradeoffs. Source code and project resources are publicly available at https://www.swarm-ai.org/.

2604.19750 2026-04-23 cs.SE cs.AI cs.HC

Coding with Eyes: Visual Feedback Unlocks Reliable GUI Code Generating and Debugging

Zhilin Liu, Ye Huang, Ting Xie, Ruizhi Zhang, Wen Li, Lixin Duan

详情
英文摘要

Recent advances in Large Language Model (LLM)-based agents have shown remarkable progress in code generation. However, current agent methods mainly rely on text-output-based feedback (e.g. command-line outputs) for multi-round debugging and struggle in graphical user interface (GUI) that involve visual information. This is mainly due to two limitations: 1) GUI programs are event-driven, yet existing methods cannot simulate user interactions to trigger GUI element logic 2) GUI programs possess visual attributes, making it difficult for text-based approaches to assess whether the rendered interface meets user needs. To systematically address these challenges, we first introduce InteractGUI Bench, a novel benchmark comprising 984 commonly used real-world desktop GUI application tasks designed for fine-grained evaluation of both interaction logic and visual structure. Furthermore, we propose VF-Coder, a vision-feedback-based multi-agent system for debugging GUI code. By perceiving visual information and directly interacting with program interfaces, VF-Coder can identify potential logic and layout issues in a human-like manner. On InteractGUI Bench, our VF-Coder approach increases the success rate of Gemini-3-Flash from 21.68% to 28.29% and raises the visual score from 0.4284 to 0.5584, indicating the effectiveness of visual feedback in GUI debugging.

2604.18951 2026-04-23 cs.MA cs.CL

Superficial Success vs. Internal Breakdown: An Empirical Study of Generalization in Adaptive Multi-Agent Systems

Namyoung So, Seokgyu Jang, Taeuk Kim

Comments 27 pages, 4 figures. Equal contribution for the first two authors

详情
英文摘要

Adaptive multi-agent systems (MAS) are increasingly adopted to tackle complex problems. However, the narrow task coverage of their optimization raises the question of whether they can function as general-purpose systems. To address this gap, we conduct an extensive empirical study of adaptive MAS, revealing two key findings: (1) topological overfitting -- they fail to generalize across different domains; and (2) illusory coordination -- they achieve reasonable surface-level accuracy while the underlying agent interactions diverge from ideal MAS behavior, raising concerns about their practical utility. These findings highlight the pressing need to prioritize generalization in MAS development and motivate evaluation protocols that extend beyond simple final-answer correctness.

2604.17511 2026-04-23 cs.LO cs.AI cs.CR

Atomic Decision Boundaries: A Structural Requirement for Guaranteeing Execution-Time Admissibility in Autonomous Systems

Marcelo Fernandez

Comments 21 pages. 1st paper (Paper 0) in the 6-paper Agent Governance Series (Papers 0-5). Zenodo: https://doi.org/10.5281/zenodo.19670649. Companion: P1/ACP (arXiv:2603.18829), P2/IML (arXiv:2604.17517), P3 (zenodo.19672597), P4 (zenodo.19672608), P5/RAM (zenodo.19669430)

详情
英文摘要

Autonomous systems increasingly execute actions that directly modify shared state, creating an urgent need for precise control over which transitions are permitted to occur. Existing governance mechanisms evaluate policies prior to execution or reconstruct behavior post hoc, but do not enforce admissibility at the exact moment a state transition is committed. We introduce the atomic decision boundary, a structural property of admission control systems in which the decision and the resulting state transition are jointly determined as a single indivisible step in the labeled transition system (LTS) model of execution. We distinguish two classes: atomic systems, where evaluation and transition are coupled within a single LTS step, and split evaluation systems, where they are separate transitions interleaved by environmental actions. The separation introduces an architectural gap -- the decision is evaluated in one system state; the transition fires in a potentially different one -- that no policy, regardless of sophistication, can close from within a split architecture. Under realistic concurrent environments, we prove via a constructive counterexample trace that no construction can make a split system equivalent to an atomic system with respect to admissibility. Three corollaries follow: impossibility of execution-time guarantees in split systems, insufficiency of external state enrichment, and admissibility as an execution-time rather than evaluation-time property. We further formalize the Escalate outcome -- absent from classical TOCTOU analyses -- proving that it transfers rather than eliminates the atomicity requirement: resolution is safe if and only if it is itself atomic. We classify RBAC, ABAC, OPA, Cedar, and AWS IAM as split systems and ACP as atomic, providing a structural taxonomy of existing governance mechanisms. Admissibility is a property of execution, not evaluation.

2604.17172 2026-04-23 cs.DC cs.AI

UCCL-Zip: Lossless Compression Supercharged GPU Communication

Shuang Ma, Chon Lam Lao, Zhiying Xu, Zhuang Wang, Ziming Mao, Delong Meng, Jia Zhen, Jun Wu, Ion Stoica, Yida Wang, Yang Zhou

详情
英文摘要

The rapid growth of large language models (LLMs) has made GPU communication a critical bottleneck. While prior work reduces communication volume via quantization or lossy compression, these approaches introduce numerical errors that can degrade convergence, accuracy, and stability. We present UCCL-Zip, a unified design that integrates lossless compression directly into GPU communication primitives. UCCL-Zip supports both point-to-point (P2P) and collective communication without modifying user-facing APIs or compromising numerical correctness. For P2P communication, Uzip-P2P employs a split-send pipeline that exposes transmissible data early and overlaps compression with communication, while preserving high GPU efficiency by operating on large data blocks. For collective communication, Uzip-NCCL integrates compression into NCCL's persistent kernel model via fused execution, eliminating redundant memory traffic and kernel launches. In real workloads, UCCL-Zip accelerates RL weight synchronization by up to 47.5% and reduces vLLM end-to-end inference latency by up to 10%, all without application changes.

2604.16779 2026-04-23 quant-ph cs.LG

Q-SINDy: Quantum-Kernel Sparse Identification of Nonlinear Dynamics with Provable Coefficient Debiasing

Samrendra Roy, Syed Bahauddin Alam

详情
英文摘要

Quantum feature maps offer expressive embeddings for classical learning tasks, and augmenting sparse identification of nonlinear dynamics (SINDy) with such features is a natural but unexplored direction. We introduce \textbf{Q-SINDy}, a quantum-kernel-augmented SINDy framework, and identify a specific failure mode that arises: \emph{coefficient cannibalization}, in which quantum features absorb coefficient mass that rightfully belongs to the polynomial basis, corrupting equation recovery. We derive the exact cannibalization-bias formula $Δξ_P = (P^\top P)^{-1}P^\top Q\,\hatξ_Q$ and prove that orthogonalizing quantum features against the polynomial column space at fit time eliminates this bias exactly. The claim is verified numerically to machine precision ($<10^{-12}$) on multiple systems. Empirically, across six canonical dynamical systems (Duffing, Van der Pol, Lorenz, Lotka-Volterra, cubic oscillator, Rössler) and three quantum feature map architectures (ZZ-angle encoding, IQP, data re-uploading), orthogonalized Q-SINDy consistently matches vanilla SINDy's structural recovery while uncorrected augmentation degrades true-positive rates by up to 100\%. A refined dynamics-aware diagnostic, $R^2_Q$ for $\dot X$, predicts cannibalization severity with statistical significance (Pearson $r=0.70$, $p=0.023$). An RBF classical-kernel control across 20 hyperparameter configurations fails more severely than any quantum variant, ruling out feature count as the cause. Orthogonalization remains robust under depolarizing hardware noise up to 2\% per gate, and the framework extends without modification to Burgers' equation.

2604.16756 2026-04-23 cs.SE cs.AI

Mitigating Prompt-Induced Cognitive Biases in General-Purpose AI for Software Engineering

Francesco Sovrano, Gabriele Dominici, Alberto Bacchelli

Comments Accepted for publication in the proceedings of FSE'2026

详情
英文摘要

Prompt-induced cognitive biases are changes in a general-purpose AI (GPAI) system's decisions caused solely by biased wording in the input (e.g., framing, anchors), not task logic. In software engineering (SE) decision support (where problem statements and requirements are natural language) small phrasing shifts (e.g., popularity hints or outcome reveals) can push GPAI models toward suboptimal decisions. We study this with PROBE-SWE, a dynamic benchmark for SE that pairs biased and unbiased versions of the same SE dilemmas, controls for logic and difficulty, and targets eight SE-relevant biases (anchoring, availability, bandwagon, confirmation, framing, hindsight, hyperbolic discounting, overconfidence). We ask whether prompt engineering mitigates bias sensitivity in practice, focusing on actionable techniques that practitioners can apply off-the-shelf in real environments. Testing common strategies (e.g., chain-of-thought, self-debiasing) on cost-effective GPAI systems, we find no statistically significant reductions in bias sensitivity on a per-bias basis. We then adopt a Prolog-style view of the reasoning process: solving SE dilemmas requires making explicit any background axioms and inference assumptions (i.e., SE best practices) that are usually implicit in the prompt. So, we hypothesize that bias-inducing features short-circuit assumption elicitation, pushing GPAI models toward biased shortcuts. Building on this, we introduce an end-to-end method that elicits best practices and injects axiomatic reasoning cues into the prompt before answering, reducing overall bias sensitivity by 51% on average (p < .001). Finally, we report a thematic analysis that surfaces linguistic patterns associated with heightened bias sensitivity, clarifying when GPAI use is less advisable for SE decision support and where to focus future countermeasures.

2604.15560 2026-04-23 astro-ph.EP astro-ph.IM cs.LG

ExoNet: Calibrated Multimodal Deep Learning for TESS Exoplanet Candidate Vetting using Phase-Folded Light Curves, Stellar Parameters, and Multi-Head Attention

Md. Rashadul Islam

Comments v2: Complete revision. Corrected systematic TOI/TIC cross-identification errors present in v1. Rebuilt inference pipeline using verified NASA Exoplanet Archive catalog (4,720 PC-disposition candidates, up from 200). Updated all results, figures, and performance metrics. 8 pages, 4 figures, 6 tables

详情
英文摘要

The discovery of exoplanets at scale has become one of the defining data science challenges in modern astrophysics. NASA's Transiting Exoplanet Survey Satellite (TESS) had catalogued over 7,800 planet candidates by early 2026, yet confirmation stands at fewer than 720. This paper introduces ExoNet, a multimodal deep learning framework that jointly processes phase-folded global and local light curve views alongside stellar parameter features through a calibrated late-fusion architecture combining 1D Convolutional Neural Networks, 8-head Multi-Head Attention over temporal feature maps, and a residual fusion head with post-hoc Temperature Scaling calibration. Trained on 7,585 labeled Kepler Objects of Interest, ExoNet achieves Test AUC = 0.9549 and 86.3% accuracy. Applied to 4,720 verified unconfirmed TESS Planet Candidates with TOI-TIC cross-identification verified against the NASA Exoplanet Archive, the model yields 1,754 high-confidence signals, 52 habitable-zone candidates, and six Earth-sized habitable-zone targets below 1.6 Earth radii. TOI-5728.01 and TOI-6716.01 emerge as the most Earth-like unconfirmed candidates. Full ablation confirms each modality improves AUC. Code and catalog are openly released.

2604.12456 2026-04-23 eess.AS cs.AI

X-VC: Zero-shot Streaming Voice Conversion in Codec Space

Qixi Zheng, Yuxiang Zhao, Tianrui Wang, Wenxi Chen, Kele Xu, Yikang Li, Qinyuan Chen, Xipeng Qiu, Kai Yu, Xie Chen

详情
英文摘要

Zero-shot voice conversion (VC) aims to convert a source utterance into the voice of an unseen target speaker while preserving its linguistic content. Although recent systems have improved conversion quality, building zero-shot VC systems for interactive scenarios remains challenging because high-fidelity speaker transfer and low-latency streaming inference are difficult to achieve simultaneously. In this work, we present X-VC, a zero-shot streaming VC system that performs one-step conversion in the latent space of a pretrained neural codec. X-VC uses a dual-conditioning acoustic converter that jointly models source codec latents and frame-level acoustic conditions derived from target reference speech, while injecting utterance-level target speaker information through adaptive normalization. To reduce the mismatch between training and inference, we train the model with generated paired data and a role-assignment strategy that combines standard, reconstruction, and reversed modes. For streaming inference, we further adopt a chunkwise inference scheme with overlap smoothing that is aligned with the segment-based training paradigm of the codec. Experiments on Seed-TTS-Eval show that X-VC achieves the best streaming WER in both English and Chinese, strong speaker similarity in same-language and cross-lingual settings, and substantially lower offline real-time factor than the compared baselines. These results suggest that codec-space one-step conversion is a practical approach for building high-quality low-latency zero-shot VC systems. Our audio samples, code and checkpoints are released at https://github.com/Jerrister/X-VC.

2604.01965 2026-04-23 cs.IR cs.AI cs.CL cs.DL

Do We Need Bigger Models for Science? Task-Aware Retrieval with Small Language Models

Florian Kelber, Matthias Jobst, Yuni Susanti, Michael Färber

Comments Accepted at NSLP@LREC 2026

详情
英文摘要

Scientific knowledge discovery increasingly relies on large language models, yet many existing scholarly assistants depend on proprietary systems with tens or hundreds of billions of parameters. Such reliance limits reproducibility and accessibility for the research community. In this work, we ask a simple question: do we need bigger models for scientific applications? Specifically, we investigate to what extent carefully designed retrieval pipelines can compensate for reduced model scale in scientific applications. We design a lightweight retrieval-augmented framework that performs task-aware routing to select specialized retrieval strategies based on the input query. The system further integrates evidence from full-text scientific papers and structured scholarly metadata, and employs compact instruction-tuned language models to generate responses with citations. We evaluate the framework across several scholarly tasks, focusing on scholarly question answering (QA), including single- and multi-document scenarios, as well as biomedical QA under domain shift and scientific text compression. Our findings demonstrate that retrieval and model scale are complementary rather than interchangeable. While retrieval design can partially compensate for smaller models, model capacity remains important for complex reasoning tasks. This work highlights retrieval and task-aware design as key factors for building practical and reproducible scholarly assistants.

2603.14222 2026-04-23 cs.CR cs.AI

Membership Inference for Contrastive Pre-training Models with Text-only PII Queries

Ruoxi Cheng, Yizhong Ding, Jian Zhao, Hongyi Zhang, Haoxuan Ma, Tianle Zhang, Yiyan Huang, Xuelong Li

详情
英文摘要

Contrastive pretraining models such as CLIP and CLAP, serve as the ubiquitous perceptual backbones for modern multimodal large models, yet their reliance on web-scale data raises growing concerns about memorizing Personally Identifiable Information (PII). Auditing such models via membership inference is challenging in practice: shadow-model MIAs are computationally prohibitive for large multimodal backbones, and existing multimodal auditing methods typically require querying the target with paired biometric inputs, thereby directly exposing sensitive biometric information to the target model. To bypass this critical limitation, we demonstrate a highly desirable capability for privacy auditing: multimodal memorization within these foundational encoders can be accurately inferred using exclusively the text modality. We propose Unimodal Membership Inference Detector (UMID), a text-only auditing framework that performs text-guided cross-modal latent inversion and extracts two complementary signals, similarity (alignment to the queried text) and variability (consistency across randomized inversions). UMID compares these statistics to a lightweight non-member reference constructed from synthetic gibberish and makes decisions via an ensemble of unsupervised anomaly detectors. Comprehensive experiments across diverse CLIP and CLAP architectures demonstrate that UMID significantly improves the effectiveness and efficiency over prior MIAs, delivering strong detection performance with sub-second auditing cost using solely text queries, completely circumventing the need for biometric inputs and complying with strict privacy constraints.

2603.09046 2026-04-23 cs.CR cs.LG cs.OS

FlexServe: A Fast and Secure LLM Serving System for Mobile Devices with Flexible Resource Isolation

Yinpeng Wu, Yitong Chen, Lixiang Wang, Jinyu Gu, Zhichao Hua, Yubin Xia

Comments 13 pages, 11 figures

详情
英文摘要

Device-side Large Language Models (LLMs) have witnessed explosive growth, offering higher privacy and availability compared to cloud-side LLMs. During LLM inference, both model weights and user data are valuable, and attackers may even compromise the OS kernel to steal them. ARM TrustZone is the de facto hardware-based isolation technology on mobile devices, used to protect sensitive applications from a compromised OS. However, protecting LLM inference with TrustZone incurs significant overhead due to its inflexible isolation of memory and the NPU. To address these challenges, this paper introduces FlexServe, a fast and secure LLM serving system for mobile devices. It first introduces a Flexible Resource Isolation mechanism to construct Flexible Secure Memory (Flex-Mem) and Flexible Secure NPU (Flex-NPU). Both memory pages and the NPU can be efficiently switched between unprotected and protected modes. Based on these mechanisms, FlexServe designs a fast and secure LLM inference framework within TrustZone's secure world. The LLM-Aware Memory Management and Secure Inference Pipeline are introduced to accelerate inference. A Multi-Model Scheduler is proposed to optimize multi-model workflows. We implement a prototype of FlexServe and compare it with two TrustZone-based strawman designs. The results show that FlexServe achieves an average $10.05\times$ speedup in Time to First Token (TTFT) compared to the strawman, and an average $2.44\times$ TTFT speedup compared to an optimized strawman with pipeline and secure NPU enabled. For multi-model agent workflows, the end-to-end speedup is up to $24.30\times$ and $4.05\times$ compared to the strawman and optimized strawman, respectively.

2602.22437 2026-04-23 cs.DC cs.AI cs.LG

veScale-FSDP: Flexible and High-Performance FSDP at Scale

Zezhou Wang, Youjie Li, Zhiqi Lin, Jiacheng Yang, Cong Xie, Guanyu Feng, Zheng Zhong, Ziyue Huang, Hongyu Zhu, Zhi Zhang, Yanghua Peng, Xin Liu

详情
英文摘要

Fully Sharded Data Parallel (FSDP), also known as Zero Redundancy Optimizer (ZeRO), is widely used for large-scale model training, because of its memory efficiency and minimal intrusion on model code. However, existing FSDP systems rely on fixed element-wise or row-wise sharding formats that conflict with block-structured computations. As a result, they struggle to support modern structure-aware training methods, including block-wise quantization and non-element-wise optimizers such as Shampoo and Muon. In addition, today's implementations incur communication and memory overheads that degrade efficiency at the scale of tens of thousands of GPUs. We introduce veScale-FSDP, a novel FSDP system that combines RaggedShard, a flexible sharding format, with a structure-aware planning algorithm to deliver both flexibility and performance. veScale-FSDP enables zero-copy FSDP communications and natively supports block-wise quantization and non-element-wise optimizers, achieving 5% to 66% higher throughput and 16% to 30% lower memory usage than existing FSDP systems, while scaling efficiently to tens of thousands of GPUs.

2602.20181 2026-04-23 cs.CY cs.AI

Catalyzing Informed Residential Energy Retrofit Decisions via Domain-Specific LLM

Lei Shu, Dong Zhao, Jianli Chen, Armin Yeganeh, Sinem Mollaoglu, Jiayu Zhou

详情
英文摘要

Residential energy retrofit initiation is often stalled by an expertise gap, where homeowners lack the technical literacy required for structured building energy assessments and are thereby trapped in low-information environments with fragmented sources. To bridge this gap, this study reports a domain-specific large language model (LLM) designed to catalyze informed decision-making based solely on homeowner-accessible, natural-language descriptions, e.g., building age, size, and location. The model is created using the parameter-efficient low-rank adaption (LoRA) fine-tuning approach on a massive corpus grounded in physics-based energy simulations and techno-economic calculations from 536,416 U.S. residential building prototypes. Nine major retrofit categories are evaluated, including envelope upgrades, HVAC systems, and renewable energy installations. Validations against physics-grounded benchmarks show that the LLM consistently identifies high-quality retrofit options, achieving top-3 hit rates of 98.9% for maximum CO2 reduction and 93.3% for the shortest discounted payback year. Moreover, the model exhibits strong robustness under incomplete input conditions, maintaining stable performance even when basic dwelling descriptions are only 60% partially specified. By significantly lowering the information activation energy for non-expert users while maintaining the scientific rigor, this physics-based AI model offers a scalable pathway for parallelized, user-centered decision making, accelerating cumulative energy savings and emission reductions across community and national scales.

2602.15037 2026-04-23 cs.SE cs.AI

CircuChain: Disentangling Competence and Compliance in LLM Circuit Analysis

Mayank Ravishankara

详情
英文摘要

As large language models (LLMs) advance toward expert-level performance in engineering domains, reliable reasoning under user-specified constraints becomes critical. In circuit analysis, for example, a numerically correct solution is insufficient if it violates established methodological conventions such as mesh directionality or polarity assignments, errors that can propagate in safety-critical systems. Yet it remains unclear whether frontier models truly apply first-principles reasoning or rely on entrenched training priors that conflict with explicit instructions. We introduce CircuChain, a diagnostic benchmark designed to disentangle instruction compliance from physical reasoning competence in electrical circuit analysis. CircuChain consists of counterbalanced Control/Trap problem pairs across five canonical circuit topologies, augmented with systematic variations in sign conventions, current orientations, and polarity definitions. A multi-stage verification pipeline, combining symbolic solvers, SPICE simulation, and an LLM-based error taxonomy, enables fine-grained attribution of failures to convention errors, physics errors, arithmetic mistakes, or hallucinations. Across 100 tasks per model, we observe a consistent Compliance-Competence Divergence. The strongest model evaluated exhibits near-perfect physical reasoning but a high rate of convention violations when Trap conditions deliberately invert natural sign patterns. Conversely, weaker models display lower physical fidelity yet superior adherence to explicit instructions. These results suggest that increased model capability does not guarantee improved constraint alignment and highlight the need for new evaluation frameworks that stress instruction-following under mathematically rigid domains. CircuChain provides one such framework and offers actionable insights for both engineering education and AI alignment research.