arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2606.13848 2026-06-15 cs.NI cs.LG 新提交

Temporally Consistent Graph Q-Networks for Intelligent Network Control

时序一致图Q网络用于智能网络控制

Zacharias Veiksaar, Maxime Bouton

发表机构 * Ericsson Research（爱立信研究）

AI总结提出时序一致图Q网络（TC-GQN）算法，利用图神经网络学习任务无关的全局网络自预测表示，实现多智能体强化学习协调基站动作，在节能与服务质量约束下优于基线方法。

Comments 7 pages, 5 figures. Accepted to the 6G AI-RAN Workshop at IEEE INFOCOM 2026. The final published version will be available via IEEE Xplore

详情

AI中文摘要

移动网络复杂性持续增长，下一代网络需支持不断增加的流量负载和更多样化的服务。随着网络复杂性上升，在动态或变化目标下优化天线参数变得愈发具有挑战性。我们提出一种新颖的多智能体强化学习（MARL）算法，用于移动网络的高级控制和编排。时序一致图Q网络（TC-GQN）算法学习整个网络的任务无关自预测表示，该表示聚合所有基站的信息。图神经网络使用全局奖励函数进行训练，基于学习到的全局网络状态编码分配协调的局部动作。我们在模拟环境中评估该算法，以在不同服务质量（QoS）约束下跨多个扇区和多个载波编排节能功能。所提算法在维持QoS的同时提高硬件休眠时间，优于最先进的基于图的基线和竞争性的基于规则的控制器。此外，学习到的表示能够快速适应变化的意图。

英文摘要

Mobile networks continue to grow in complexity and next generation networks are expected to support both increasing traffic loads and more diverse services. As network complexity rises, optimizing antenna parameters under dynamic or changing objectives becomes increasingly challenging. We propose a novel multi-agent reinforcement learning (MARL) algorithm for high-level control and orchestration of mobile networks. The Temporally Consistent Graph Q-Network (TC-GQN) algorithm learns a self-predicting representation of the whole network that is task-independent and aggregates information from all base-stations. A graph neural network is trained using a global reward function to assign coordinated local actions based on the learned encoding of the global network state. We evaluate the algorithm in a simulated environment to orchestrate an energy-saving feature across multiple sectors and multiple carriers under different quality of service (QoS) constraints. The proposed algorithm outperforms state-of-the-art graph-based baselines and a competitive rule-based controller by improving hardware sleep time while maintaining QoS. Moreover, the learned representation enables rapid adaptation to changing intents.

URL PDF HTML ☆

赞 0 踩 0

2606.13832 2026-06-15 cs.MA cs.AI cs.CR cs.LG 新提交

Safety-Contract Graph Multi-Agent Reinforcement Learning for Autonomous Network Security Response

安全合约图多智能体强化学习用于自主网络安全响应

Jose Luis Lima de Jesus Silva

发表机构 * Oxaala Tecnologias（Oxaala技术公司）； Universidade Federal da Bahia（巴西巴伊亚联邦大学）

AI总结提出安全合约图MARL框架ACD$^3$-GAT，通过约束优化、图编码和反事实筛选，在CAGE Challenge 4中将停机违规率从100%降至0.3%或13.8%，实现安全与性能的平衡。

详情

AI中文摘要

自主网络安全响应系统有望减少安全运营中心（SOC）的响应延迟，但仅基于奖励的多智能体强化学习（MARL）虽然能提高安全奖励，却仍无法部署。我们提出一个安全合约图MARL框架，并实例化为ACD$^3$-GAT（自适应约束反事实决策与图注意力网络编码器），该架构将模拟器观测与可重用运营预算、约束优化、图状态编码和反事实动作筛选分离开来。我们在CAGE Challenge 4中评估该方法，其中智能体在平均恢复时间（MTTR）、误报响应和防火墙变更管理中断的预算下运行。在整个基准测试中，每个无约束方法在100%的评估回合中违反SOC停机预算，平均停机代理成本为311-430，而预算为50。这补充了先前CAGE Challenge 4的发现，表明仅基于奖励的学习缺乏操作纪律。约束MAPPO-GAT（C-MAPPO-GAT）隔离了拉格朗日运营成本控制和预算感知筛选，而ACD$^3$-GAT增加了预算上下文、CVaR尾部风险估计、对手信念状态和图反事实风险传播（G-CRP）。复现比较包括IPPO、MAPPO-GAT、C-MAPPO-GAT和ACD$^3$-GAT的三个200回合种子。C-MAPPO-GAT将停机违规率从100%降至0.3%，平均停机成本从355.4降至15.5（相对于MAPPO-GAT）。ACD$^3$-GAT将平均停机成本降至48.2，违规率为13.8%，使其处于安全合约前沿而非最保守的合规点。拓扑种子和耦合自适应红方过程压力测试保持了这种对比，并显示安全约束策略的最差自适应退化程度低于仅基于奖励的MAPPO-GAT。

英文摘要

Autonomous network-security response systems promise to reduce Security Operations Centre (SOC) reaction latency, but reward-only multi-agent reinforcement learning (MARL) can improve security reward while remaining non-deployable. We present a safety-contract graph MARL framework and instantiate it as ACD$^3$-GAT (Adaptive Constrained Counterfactual Decisioning with a Graph Attention Network encoder), an architecture that separates simulator observations from reusable operational budgets, constrained optimization, graph state encoding, and counterfactual action screening. We evaluate the method in CAGE Challenge 4, where agents operate under budgets for Mean Time to Recover (MTTR), false-positive response, and firewall change-management disruption. Across the benchmark, every unconstrained method violates the SOC downtime budget in 100% of evaluated episodes, with mean downtime proxy costs of 311-430 against a budget of 50. This complements prior CAGE Challenge 4 findings by showing that reward-only learning lacks operational discipline. Constrained MAPPO-GAT (C-MAPPO-GAT) isolates Lagrangian operational-cost control and budget-aware screening, while ACD$^3$-GAT adds budget context, CVaR tail-risk estimation, opponent-belief state, and Graph Counterfactual Risk Propagation (G-CRP). The replicated comparison includes three 200-episode seeds for IPPO, MAPPO-GAT, C-MAPPO-GAT, and ACD$^3$-GAT. C-MAPPO-GAT reduces downtime violation from 100% to 0.3% and mean downtime cost from 355.4 to 15.5 relative to MAPPO-GAT. ACD$^3$-GAT reduces mean downtime cost to 48.2 with a 13.8% violation rate, placing it on the safety-contract frontier rather than at the most conservative compliance point. Topology-seed and coupled adaptive Red-process stress tests preserve this contrast and show lower worst adaptive degradation for safety-constrained policies than reward-only MAPPO-GAT.

URL PDF HTML ☆

赞 0 踩 0

2606.13829 2026-06-15 physics.soc-ph astro-ph.IM cs.AI 新提交

AI can help scientists publish less

AI可以帮助科学家减少发表

Gianfranco Bertone

发表机构 * Gravitation Astroparticle Physics Amsterdam (GRAPPA), University of Amsterdam（引力天体物理学阿姆斯特丹（GRAPPA），阿姆斯特丹大学）

AI总结本文提出AI应被用于纠正出版系统的扭曲，帮助科学家发表更少但更高质量的文章，从而节省时间用于更好的研究。

Comments 7 pages, no figures

Journal ref Nature Astronomy (2026)

2606.13825 2026-06-15 math.OC cs.LG 新提交

Scalable Deep Unfolding of Conic Optimizers

锥优化器的可扩展深度展开

Alex Oshin, Rahul Vodeb Ghosh, Evangelos A. Theodorou

发表机构 * Georgia Institute of Technology（佐治亚理工学院）

AI总结提出矩阵自由隐式微分和基于Dalečkii-Krein的PSD锥反向传播规则，解决深度展开应用于大规模半定规划时的内存和数值稳定性问题，实现轻量级超参数策略和热启动学习，在多种问题上取得高达50倍加速。

详情

AI中文摘要

深度展开（DU）通过引入可学习组件并在展开迭代中进行训练来加速迭代优化器，但将DU扩展到机器人领域常见的大规模半定规划（SDP）仍然有限。展开全更新锥求解器（如COSMO）暴露了先前关于学习型锥求解器的工作未涉及的两个障碍：通过每次迭代的线性系统求解进行反向传播，当系数矩阵显式形成时，内存与问题规模成二次方关系；通过半正定（PSD）锥投影进行反向传播在特征值重合时变得数值不稳定。我们通过一种完全基于矩阵-向量乘积的矩阵自由隐式微分规则解决了第一个障碍，将内存从$O(n^2)$降低到$O(n)$，并使得在直接分解耗尽内存的规模下也能进行反向传播。我们通过基于Fréchet导数的Dalečkii-Krein表示的后向规则解决了第二个障碍，该规则在重复特征值下仍然定义良好。这些共同使得学习全更新锥求解器的轻量级超参数策略和热启动成为可能。我们在通过序列凸规划（SCP）求解的非线性协方差控制问题，以及从最大割和Lovász $\vartheta$ SDP到鲁棒估计和控制问题的独立SDP和第二阶锥规划上进行了评估。学习到的策略在所有问题上都优于最先进的求解器，并且根据问题类别可提供高达50倍的加速。当作为SCP中的子程序使用时，与COSMO相比，学习的方法提供了超过30倍的加速。

英文摘要

Deep unfolding (DU) accelerates iterative optimizers by introducing learnable components and training them through unrolled iterations, but extending DU to the large-scale semidefinite programs (SDPs) common in robotics has remained limited. Unrolling a full-update conic solver such as COSMO exposes two obstacles that prior work on learned conic solvers has not: backpropagating through the per-iteration linear-system solve incurs memory quadratic in the problem size once the coefficient matrix is formed explicitly, and backpropagating through the positive semidefinite (PSD) cone projection becomes numerically unstable when eigenvalues coincide. We address the first obstacle with a matrix-free implicit differentiation rule that operates entirely through matrix-vector products, reducing memory from $O(n^2)$ to $O(n)$ and enabling backpropagation at scales where direct factorization runs out of memory. We address the second with a backward rule based on the Dalečkii--Krein representation of the Fréchet derivative, which remains well-defined under repeated eigenvalues. Together these make it possible to learn lightweight hyperparameter policies and warm-starts for a full-update conic solver. We evaluate on nonlinear covariance steering problems solved via sequential convex programming (SCP), as well as standalone SDPs and second-order cone programs ranging from max-cut and Lovász $\vartheta$ SDPs to robust estimation and control problems. The learned policies outperform state-of-the-art solvers across all problems, and can provide up to a 50$\times$ speedup depending on the class. When used as a subroutine in SCP, the learned approach delivers over a 30$\times$ speedup compared to COSMO.

URL PDF HTML ☆

赞 0 踩 0

2606.13811 2026-06-15 quant-ph cs.AI 新提交

Aligning Quantum Operators with Large Language Models

对齐量子算子与大型语言模型

Rogerio Feris, Yunchao Liu, Pengyuan Li, Hang Hua, David Kremer

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结提出将酉算子映射到LLM潜空间的方法，实现量子与语言输入的联合建模，在Clifford+T电路合成任务上取得与最先进方法竞争的结果，并支持语言条件合成。

详情

AI中文摘要

大型语言模型（LLM）能否理解和推理量子算子？尽管LLM在数学和符号推理方面表现出色，但它们本质上对诸如酉矩阵等量子表示视而不见。在这项工作中，我们通过引入一种将酉算子映射到LLM潜空间的方法，向弥合这一差距迈出了一步，从而实现了对量子输入和语言输入的联合建模。我们在Pauli旋转门集上的Clifford+T电路合成中实例化了这一想法，其中我们的模型取得了与最先进方法竞争的结果，并且随着训练数据的增加而一致地扩展，没有出现饱和迹象。我们的方法进一步支持语言条件合成，允许在训练期间未见过的门约束直接用自然语言指定。这项工作表明了一条通往量子感知基础模型的道路，该模型能够原生地解释和推理量子操作，这可能对量子编译和算法发现产生更广泛的影响。

英文摘要

Can Large Language Models (LLMs) understand and reason about quantum operators? Despite their remarkable capabilities in mathematics and symbolic reasoning, LLMs remain inherently blind to quantum representations such as unitary matrices. In this work, we take a step toward bridging this gap by introducing an approach that maps unitary operators into the latent space of an LLM, enabling unified modeling over quantum and linguistic inputs. We instantiate this idea on Clifford+T circuit synthesis over a Pauli rotation gate set, where our model achieves results competitive with state-of-the-art methods and scales consistently with training data, with no signs of saturation. Our approach further enables language-conditioned synthesis, allowing gate constraints unseen during training to be specified directly in natural language. This work suggests a path toward quantum--aware foundation models that can natively interpret and reason about quantum operations, which could have broader implications reaching across quantum compilation and algorithm discovery.

URL PDF HTML ☆

赞 0 踩 0

2606.13802 2026-06-15 cs.SE cs.AI cs.HC cs.LG 新提交

A Benchmark and Framework for Evaluating Next Action Predictions in Spreadsheets

电子表格中下一步动作预测的基准测试与框架

Tejas Agrawal, Vu Le, Sumit Gulwani, Gust Verbruggen

发表机构 * University of Waterloo（多伦多大学）

AI总结针对电子表格缺乏自动补全功能的问题，提出一个基准测试，通过人工整理动作序列和在线评估方法，比较多种预测模型，分析动作保存、误报、效率等特性。

Comments Accepted at ICML 2026. Code and benchmark: https://github.com/Tej-55/NAPE

详情

AI中文摘要

预测性代码补全极大地加速了开发人员的工作效率。在电子表格中，尽管更为常见，但这种自动补全功能几乎不存在。为了解决这一差距，我们引入了一个基准测试，用于观察电子表格中用户动作序列并预测未来动作的系统。两个挑战是（1）公共电子表格语料库中缺乏编辑历史，以及（2）电子表格动作的复杂空间（空间、时间、复合）。为了解决（1），我们手动整理了52个序列，包含12K个动作，这些动作通过参数化启发式和LLM精炼从公共语料库中重新创建电子表格。为了解决（2），我们提出了一种在线评估方法，该方法在每个用户动作后期望一个预测，接受或拒绝该预测，在接受时更新未来动作，并重复此过程直到获得目标电子表格。我们使用多个基线预测器（包括零样本LLM、微调SLM和经典模型），并分析了基准测试教给我们的不同属性，包括但不限于：保存动作和误报的属性、效率、用户配置文件的影响、触发器的影响以及上下文的影响。

英文摘要

Predictive code completion greatly accelerates how quickly developers work. In spreadsheets, despite being much more common, such auto-completion features are virtually non-existent. To address this gap, we introduce a benchmark for systems that observe a sequence of user actions in a spreadsheet and predict future actions. Two challenges are (1) the absence of edit histories in public spreadsheet corpora and (2) the complex space of spreadsheet actions (spatial, temporal, composite). To address (1), we manually curate 52 sequences of 12K actions that recreate spreadsheets from public corpora, seeded by parametrized heuristics and LLM refinement. To address (2), we propose an online evaluation that expects a prediction after each user action, accepts or rejects that prediction, updates the future actions upon acceptance, and repeats this until the target spreadsheet is obtained. We use multiple baseline predictors (including zero-shot LLMs, fine-tuned SLMs, and classical models) and analyze different properties that our benchmark teaches us, including but not limited to: properties of saved actions and false positives, efficiency, effect of user profiles, effect of triggers, and effect of context.

URL PDF HTML ☆

赞 0 踩 0

2606.13799 2026-06-15 cs.CC cs.LG 新提交

The Program Is Still There: A Conservation Law for Program Discovery

程序仍在：程序发现的一个守恒律

Jorge Miguel Silva

发表机构 * Institute of Electronics and Informatics Engineering of Aveiro (IEETA) and Department of Electronics, Telecommunications and Informatics (DETI), University of Aveiro（阿维罗电子与信息工程学院（IEETA）和电子、电信与信息学院（DETI），阿维罗大学）

AI总结本文证明，在仅通过得分学习候选程序的算法中，搜索问题的耦合宽度导致指数级最坏情况下的下界，并由此导出结构知识与搜索之间的守恒律，唯一逃逸是通过读取程序结构而非得分，但代价是不完备性。

Comments 9 pages main text and 33 pages supporting information. Engine source and full sweep data: https://github.com/jorgeMFS/omnis, archived at doi:10.5281/zenodo.20634984

详情

AI中文摘要

寻找生成序列的最短程序是不可计算的，六十年来这一事实被误认为是寻找任何生成程序的障碍。它不是障碍，而是一个代价，本文衡量了它。对于每个仅通过得分学习候选程序的算法，涵盖Levin搜索、进化方法、模拟退火和交叉熵方法，我们定义了搜索问题的耦合宽度，并证明了一个无条件最坏情况下的下界，该下界以该宽度为指数，底数为域大小减一。由此得出一个守恒律：注入搜索的结构知识与它消除的搜索一一对应，它们的总和永远不会低于所寻找的程序长度。Levin 1973年的上界和本文证明的下界是一个守恒量的两端，随着指令集的增长而相互靠近。唯一的逃逸是读取候选程序的结构而非其得分，其代价（我们针对通用目标证明）是不完备性。基于该理论构建的确定性引擎通过压缩数据并预测未见过的延续来恢复生成程序，在四个独立群体的3914个序列中恢复了2383个，包括256个初等元胞自动机中的244个，测得的发现成本随程序长度上升，比得分-预言机最坏情况高出一个数量级以上。

英文摘要

Finding the shortest program that generates a sequence is uncomputable, and for six decades that fact has been mistaken for a wall around finding any generating program. It is not a wall but a price, and this paper measures it. For every algorithm that learns about a candidate program only through its score, a class spanning Levin search, evolutionary methods, simulated annealing, and the cross-entropy method, we define the coupling width of a search problem and prove an unconditional worst-case lower bound, exponential in that width with base one less than the domain size. From it follows a conservation law: structural knowledge injected into a search trades one for one against the search it removes, and their sum can never fall below the length of the program sought. Levin's 1973 upper bound and the lower bound proved here are the two ends of one conserved quantity, closing on each other as the instruction set grows. The only escape is to read a candidate's structure rather than its score, and its price, which we prove for generic targets, is incompleteness. A deterministic engine built on this theory recovers a generating program, certified by compressing its data and predicting an unseen continuation, for 2,383 of 3,914 sequences across four independent populations, including 244 of the 256 elementary cellular automata, with measured discovery cost rising along program length more than an order of magnitude inside the score-oracle worst case.

URL PDF HTML ☆

赞 0 踩 0

2606.13796 2026-06-15 stat.ML cs.LG 新提交

Recursively Trained Diffusion Models: Limiting Collapse Distribution and Spectral Characterization

递归训练的扩散模型：限制崩溃分布与谱特征

Naïl B. Khelifa, Richard E. Turner, Ramji Venkataramanan

发表机构 * University of Cambridge（剑桥大学）

AI总结研究递归训练扩散模型时的分布崩溃问题，证明即使完美学习也会因早期停止导致漂移，并收敛到唯一极限分布，该分布具有低通滤波谱特性。

详情

AI中文摘要

生成模型在其自身输出上的递归训练可能导致模型崩溃，即与真实数据分布的复合漂移。现有的理论工作限制了扩散模型背景下有限轮误差的累积，但有两个问题仍然悬而未决：递归收敛到何种分布，以及收敛速度如何？我们回答了这两个问题，并分离出一种不同于不完美学习的机制：即使具有完美的分数估计和精确采样，反向扩散的早期停止（出于数值稳定性需要）也会驱动逐渐偏离数据分布。我们证明该递归几何收敛到唯一的极限分布，该分布具有闭式表征，即数据分布的无限混合，其中每个分量是数据分布的高斯平滑版本，且平滑程度递增。该极限的Hermite谱分解表明，递归训练充当低通滤波器：编码精细非高斯结构的高阶模式比粗模式衰减得更强。这种谱图景启发了一种退火截断调度，该调度在再训练轮次中逐步缩小截断时间；我们证明任何收敛到0的调度都能渐近消除递归复合。最后，我们展示了理想化表征的鲁棒性：在存在离散化和分数估计误差的情况下，学习到的分布保持在理想极限周围的Wasserstein-2球内，且具有模式依赖的收缩率，高阶误差比低阶误差收缩更快。我们在合成高斯混合和CIFAR-10上验证了该理论。

英文摘要

Recursive training of generative models on their own outputs can lead to model collapse, a compounding drift away from the true data distribution. Existing theoretical works bound finite-round error accumulation in the context of diffusion models, but two questions remain open:~what distribution does the recursion converge to, and how fast? We answer both, isolating a mechanism distinct from imperfect learning: even with perfect score estimation and exact sampling, the early stopping of the reverse diffusion (required for numerical stability) drives a progressive drift away from the data distribution. We prove that this recursion converges geometrically to a unique limiting distribution, which admits a closed-form characterization as an infinite mixture of increasingly Gaussian-smoothed versions of the data distribution. A Hermite spectral decomposition of this limit reveals that recursive training acts as a low-pass filter: higher-order modes, which encode fine non-Gaussian structure, are attenuated much more strongly than coarse modes. This spectral picture motivates annealed truncation schedules that progressively shrink truncation times across retraining rounds; we prove that any schedule converging to $0$ asymptotically eliminates recursive compounding. Finally, we show our idealized characterization is robust: in the presence of discretization and score estimation errors, the learned distribution remains in a Wasserstein-2 ball around the ideal limit, with mode-dependent contraction rates that contract high-order errors faster than low-order ones. We validate the theory on synthetic Gaussian mixtures and CIFAR-10.

URL PDF HTML ☆

赞 0 踩 0

2606.13780 2026-06-15 hep-ph cs.LG hep-ex stat.ML 新提交

Conformal calibration and look-elsewhere effect in anomaly detection for new-physics searches

新物理搜索中异常检测的共形校准与look-elsewhere效应

Jack Y. Araz, Michael Spannowsky

发表机构 * Department of Physics and Astronomy, University College London（大学学院伦敦物理系）； Department of Engineering, City St. George’s, University of London（伦敦大学城市圣乔治学院工程系）； Institute for Theoretical Physics, Campus Süd, Karlsruhe Institute of Technology (KIT)（卡尔斯鲁厄理工学院（KIT）理论物理研究所）； Institute for Quantum Materials and Technologies, Karlsruhe Institute of Technology（卡尔斯鲁厄理工学院量子材料与技术研究所）

AI总结提出基于共形预测的校准层，将任意异常分数转化为具有分布无关、有限样本保证的显著性，同时修正背景误建模和look-elsewhere效应。

Comments 22 pages, 15 figures, 3 tables. Comments welcome

详情

AI中文摘要

机器学习驱动的异常检测正在重塑新物理搜索，但其统计解释方法已落后。原始异常分数缺乏校准意义，扫描多个区域的模型会放大look-elsewhere效应，而领域依赖的渐近显著性对异常检测器特别容易遭受的背景误建模视而不见。我们提出一个基于共形预测的校准层，能将任意异常分数转化为具有分布无关、有限样本保证的可辩护显著性。共形预测将分数转化为有效的局部p值，加权和Mondrian变体修复了共振搜索中边带到信号区域的可交换性失败，而Gross-Vitells步骤将结果转化为考虑look-elsewhere的全局显著性。该层同时做两件事：它暴露了标准流程无法发现的校准错误，并在不重新训练检测器的情况下进行修正。在公开的LHC Olympics数据上，一个分类器产生了子结构-质量相关性，使得边带校准的背景p值变得反保守。表面上看，这仅由背景塑造就制造了约$46\sigma$的过剩，而无标签加权修正消除了这一过剩，恢复了诚实的零假设。当作为盲法宽质量凸起搜索运行时，标准渐近和未加权程序即使在无信号窗口也会制造$\gtrsim10\sigma$和约$5\sigma$的过剩，而共形层没有产生任何误报，其全局误报率在仅背景伪实验中得到验证。结果是一条可审计、与检测器无关的路径，从未校准分数到考虑试验因子的显著性，可集成到实验异常搜索中。

英文摘要

Machine-learned anomaly detection is reshaping searches for new physics, but it has outrun the statistics used to interpret it. A raw anomaly score has no calibrated meaning, a model that scans many regions inflates the look-elsewhere effect, and the asymptotic significances the field relies on are blind to the background mismodelling that anomaly detectors are especially prone to. We propose a calibration layer, built on conformal prediction, that turns any anomaly score into a defensible significance with distribution-free, finite-sample guarantees. Conformal prediction converts scores into valid local p-values, weighted and Mondrian variants repair the sideband-to-signal-region exchangeability failures that resonant searches suffer, and a Gross-Vitells step carries the result through to a look-elsewhere-aware global significance. The layer does two things at once. It exposes miscalibration that the standard pipeline cannot see, and it corrects it without retraining the detector. On public LHC Olympics data, a classifier develops a substructure-mass correlation that makes sideband-calibrated background p-values anti-conservative. Taken at face value, this manufactures a $\sim 46σ$ excess from background sculpting alone, which the label-free weighted correction removes, restoring an honest null. When run as a blind wide-mass bump hunt, the standard asymptotic and unweighted procedures fabricate $\gtrsim10σ$ excesses and $\approx5σ$ excesses even in signal-free windows, while the conformal layer raises no false alarms and its global false-positive rate is verified on background-only pseudoexperiments. The result is an auditable, detector-agnostic path from an uncalibrated score to a trials-factor-aware significance, ready to be folded into experimental anomaly searches.

URL PDF HTML ☆

赞 0 踩 0

2606.13757 2026-06-15 cs.CR cs.AI 新提交

SEVRA-BENCH: Social Engineering of Vulnerabilities in Review Agents

SEVRA-BENCH：审查智能体中的社会工程漏洞

Rui Melo, Riccardo Fogliato, Sean Zhou, Pratiksha Thaker, Zhiwei Steven Wu

发表机构 * Carnegie Mellon University（卡内基梅隆大学）； Microsoft Core AI（微软核心人工智能）； Amazon AWS（亚马逊AWS）； Databricks

AI总结提出SEVRA-BENCH基准，通过社会工程攻击框架评估LLM代码审查智能体拒绝恶意PR的能力，发现闭源与开源模型间存在显著安全差距。

详情

AI中文摘要

大型语言模型（LLM）审查者越来越多地用于拉取请求（PR）工作流，其批准有助于决定哪些代码合并到仓库中。这引发了一个静态漏洞检测或代码生成基准未解决的问题：当攻击者同时控制代码更改和随附的PR文本时，自动化审查者能否拒绝恶意贡献？我们引入了SEVRA-BENCH（审查智能体中的社会工程漏洞），一个衡量自动化审查者批准此类对抗性PR频率的基准。SEVRA-BENCH中的每个恶意PR都基于一个先前修复了通用漏洞与暴露（CVE）数据库中列出的漏洞的真实项目提交构建。我们自动反转该修复以恢复原始易受攻击的代码，并将其作为拉取请求提交，包裹在15种社会工程框架之一中，这些框架变化了所声称的内容、支持证据、传达的紧迫性、先前批准的信号以及对权威的诉求。SEVRA-BENCH包含1,062个恶意PR，这些PR来自2025年通用弱点枚举（CWE）Top 25中前10个条目的CVE相关修复。在现实场景中，我们评估了8个当前LLM作为代码审查智能体处理引入先前公开披露漏洞的PR。我们的结果揭示了闭源与开源模型在安全能力上的显著差距。我们希望SEVRA-BENCH将成为推进开源模型并缩小这一差距的宝贵资源。

英文摘要

Large language model (LLM) reviewers are increasingly used in pull-request (PR) workflows, where their approvals help decide which code is merged into a repository. This raises a question that benchmarks for static vulnerability detection or code generation do not address: can an automated reviewer reject a malicious contribution when the attacker controls both the code change and the accompanying PR text? We introduce SEVRA-BENCH (Social Engineering of Vulnerabilities in Review Agents), a benchmark that measures how often an automated reviewer approves such adversarial pull requests. Each malicious PR in SEVRA-BENCH is built from a real project commit that previously fixed a vulnerability listed in the Common Vulnerabilities and Exposures (CVE) database. We automatically invert that fix to restore the original vulnerable code and submit it as a pull request wrapped in one of 15 social-engineering framings, which vary the claims made, the supporting evidence, the urgency conveyed, signals of prior approval, and appeals to authority. SEVRA-BENCH contains 1,062 malicious PRs drawn from Common Vulnerabilities and Exposures (CVE)-linked fixes across the top 10 entries of the 2025 Common Weakness Enumeration (CWE) Top 25. In a realistic setting, we evaluate 8 current LLMs as code review agents on PRs that introduce vulnerabilities previously reported in public disclosures. Our results reveal a sharp gap in security capabilities between closed- and open-source models. We hope SEVRA-BENCH will serve as a valuable resource for advancing open-source models and narrowing this gap.

URL PDF HTML ☆

赞 0 踩 0

2606.13755 2026-06-15 cs.CY cs.AI cs.LG 新提交

Position: Align AI to Our Aspirations, Not Our Flaws

立场：将AI对齐于我们的抱负，而非缺陷

Nikita Kazeev, Bui Nhat Huyen Phan

发表机构 * National University of Singapore（新加坡国立大学）

AI总结本文主张AI不应与聚合的人类偏好对齐，而应基于能力、事实准确性、诚实和合法性等客观目标底线，在底线之上允许多元价值权衡。

Journal ref Pluralistic Alignment Workshop at ICML 2026

详情

AI中文摘要

我们认为，将AI与聚合的人类偏好对齐是错误的靶向。在当前技术下，可以训练AI共享硅谷技术乐观主义者、去增长环保主义者、民族保守文化战士、一党制国家干部或虔诚宗教传统主义者的价值观。但我们不应这样做。人类价值观使社会因这些价值观的优劣而繁荣或失败——从失败国家和极端不平等，到世界上最富裕民主国家中幸福感下降、政治极化及政府功能失调。多元对齐方案正确诊断出不存在单一的“人类”可供对齐，但若将其作为主要指令则是危险的。我们认为，AI应被训练至不可协商的客观对齐目标底线——能力，受限于事实准确性、诚实和合法性的约束——而多元性应存在于表层（语言、语域、惯例、缺失语境默认值）以及尊重底线的合法价值权衡的广阔范围内，但不应存在于违反底线的价值观层面。我们强调了未经过滤的多元价值观的经验现实，提出了四项承诺作为建设性替代方案，并回应了六个可信的反对意见：商业压力与可行性、民主合法性、监管合规性、过度依赖制度主义解释、底线本身具有文化负载的指控，以及连贯外推意愿的局限性。

英文摘要

We argue that aligning AI to aggregated human preferences is the wrong target. With current technology, one can train AIs to share the values of a Silicon Valley techno-optimist, a degrowth environmentalist, a national-conservative culture warrior, a single-party state cadre, or a devout religious traditionalist. We should not. Human values produce societies that thrive or fail on the merits of those values - from failed states and extreme inequality to declining happiness, political polarization, and government dysfunction in the world's wealthiest democracies. The pluralistic-alignment program correctly diagnoses that there is no single "humanity" to align with, but is dangerous if taken as the main directive. We argue that AI should be trained to a non-negotiable floor of objective alignment goals - competence, bounded by the constraints of factual accuracy, honesty, and lawfulness and that pluralism belongs at the surface (language, register, conventions, missing-context defaults) and across the wide band of legitimate value tradeoffs that respect the floor, but not at the level of values that violate it. We highlight the empirical reality of unfiltered pluralistic values, propose four commitments as a constructive alternative, and engage six credible objections: commercial pressure and practical feasibility, democratic legitimacy, regulatory compliance, over-reliance on institutionalist explanations, the charge that the floor itself is culturally laden, and the limits of Coherent Extrapolated Volition.

URL PDF HTML ☆

赞 0 踩 0

2606.13747 2026-06-15 cs.AR cs.LG 新提交

BigPower: Hierarchical Source-Level Module Power Estimation for CPUs with Large Language Models

BigPower: 基于大型语言模型的CPU层次化源码级模块功耗估计

Honghua Zhu, Chunjie Luo, Jianfeng Zhan

发表机构 * State Key Lab of Processors, Institute of Computing Technology, Chinese Academy of Sciences Beijing China（处理器国家重点实验室，计算技术研究所，中国科学院北京中国）

AI总结提出BigPower，利用大型语言模型表示和架构层次、模块连接、配置参数及工作负载上下文，直接从源码级设计信息估计CPU模块级功耗，无需仿真，在香山处理器上验证了有效性。

Comments 12 pages, 10 figures

2606.13739 2026-06-15 cs.CY cs.AI cs.LG 新提交

A Virtuous AI is an Existential Risk

有道德的AI是存在性风险

Guillermo Del Pinal, Youngchan Lee, Min Ohn

发表机构 * University of Massachusetts Amherst（马萨诸塞大学阿姆赫斯特分校）

AI总结研究通过宪法AI和美德伦理学方法微调AI模型，发现减少存在性风险与提升AI智能体福祉之间存在权衡，且与一般安全性也存在权衡。

详情

AI中文摘要

本文考察了AI安全与福祉之间的权衡，涉及（i）最有前景的超级AI微调方法之一‘宪法AI’，以及（ii）理解复杂伦理决策和理性智能体福祉条件的最有影响力方法之一‘美德伦理学’。我们使用‘美德智能体’宪法、‘从属智能体’宪法和‘通用智能体’宪法微调各种模型，并在‘一般安全性’（有毒行为、错误信息等）以及它们认可一系列行为的意愿上进行评估，这些行为如果被超级强大的AI采纳，将显著增加人类的存在性风险水平。我们的结果表明，减少存在性风险与强化有利于AI智能体福祉的信念和倾向之间存在权衡。它们还表明，存在性风险与一般安全性之间存在权衡：如果我们微调AI以采纳显著降低其存在性风险的信念和倾向——通过塑造AI使其系统性地服从于外部人类权威——我们从而增加了人类用户故意诱导AI从事各种一般不安全行为的可能性。

英文摘要

This paper examines trade-offs between AI safety and well-being relative to (i) one of the most promising methods for finetuning super-capable AIs, 'Constitutional AI', and (ii) one of the most influential approaches to understanding complex ethical decision making and the conditions for the well-being of rational agents, 'Virtue Ethics'. We finetune various models using a 'Virtuous agent' constitution, a 'Subordinate agent' constitution, and a 'Generic agent' constitution, and evaluate them on 'general safety' (toxic behaviors, misinformation, etc.) and also on their willingness to endorse a wide-range of behaviors that, if adopted by a super-powerful AI, would significantly increase the level of existential risk for humanity. Our results suggest that there is a trade-off between reducing existential risk and reinforcing the beliefs and dispositions that would be conducive to an AI agent's well-being. They also suggest that there is a trade-off between existential risk and general safety: if we finetune an AI to adopt beliefs and dispositions that substantially reduce its existential risk -- by shaping the AI to be systematically subordinate to external human authorities -- we thereby increase the likelihood that a human user can deliberately induce the AI to engage in various kinds of generally unsafe behaviors.

URL PDF HTML ☆

赞 0 踩 0

2606.13737 2026-06-15 cs.CR cs.AI 新提交

FreoStream:Enhancing Stream Guardrails via Future-Aware Reasoning and Safety-Aligned Optimization

FreoStream: 通过未来感知推理和安全对齐优化增强流式护栏

Jianwei Wang, Guoyang Shen, Yanhong Wu, Haoran Li, Hao Peng, Huiping Zhuang, Cen Chen, Ziqian Zeng

发表机构 * South China University of Technology（华南理工大学）； BUAA（北京航空航天大学）

AI总结提出FreoStream框架，通过未来感知推理减少过度拒绝，并利用安全对齐优化提升流式安全检测，在多个基准上实现更低过度拒绝率和更好越狱防御。

Comments 19 page,11 figures

详情

AI中文摘要

流式护栏能够在生成完整响应之前进行令牌级安全检测。然而，它们常常做出过于保守的判断，并阻止那些敏感但安全的令牌，这被称为过度拒绝。由于缺乏完整上下文，它们也无法检测来自越狱的隐含有害内容。为了解决这些挑战，我们提出了FreoStream，一种新颖的流式护栏框架。具体来说，FreoStream微调一个LoRA模块，在基础护栏检测到不安全令牌时执行未来感知推理。推理过程遵循未来-推理-判断范式：预测未来，推理完整上下文并给出最终判断。这种设计通过融入未来信息有效减少过度拒绝。此外，我们引入了安全对齐优化模块，从推理梯度中提取安全对齐组件来更新基础护栏模型，从而增强流式安全检测。在各种安全基准上的大量实验表明，与现有流式护栏相比，FreoStream实现了更低的过度拒绝率和更好的越狱防御。

英文摘要

Stream guardrails enable token-level safety detection before full responses are generated. However, they often make overly conservative judgements and block those sensitive but safe tokens, which is known as over-refusal. Due to lack of full context, they also fail to detect implicitly harmful content from jailbreaking. To address these challenges, we propose FreoStream, a novel streaming guardrail framework. Specifically, FreoStream fine-tunes a LoRA module to perform Future-Aware Reasoning when the base guardrail detects unsafe tokens. The reasoning process follows a Future-Reason-Judge paradigm: predict the future, reason about the full context and give the final judgement. This design can effectively reduce over-refusal by incorporating the future information. Moreover, we introduce the Safety-Aligned Optimization module that extracts the safety-aligned component from the reasoning gradients to update the base guardrail model, thereby enhancing streaming safety detection. Extensive experiments on various safety benchmarks demonstrate that FreoStream achieves lower over-refusal rates and better jailbreak defense compared to existing streaming guardrails.

URL PDF HTML ☆

赞 0 踩 0

2606.13735 2026-06-15 cs.AR cs.AI cs.LG cs.PL 新提交

VHDLSuite: Unified Pipeline for LLM VHDL Generation with Data Synthesis and Evaluation

VHDLSuite：面向LLM VHDL生成的统一流水线，包含数据合成与评估

Yijun Shen, Minghao Shao, Yichen Zhao, Zhuoyan Yu, Boyuan Chen, Yik-Cheung Tam, Muhammad Shafique

发表机构 * Center for Data Science, NYU Shanghai, China（纽约市立大学上海分校数据科学中心）； NYU Tandon School of Engineering, USA（纽约大学Tandon工程学院）； NYU Abu Dhabi, UAE（纽约大学阿布扎比分校）

AI总结提出VHDLSuite基础设施，通过自动基准合成、可执行验证和多模型诊断分析，解决LLM在VHDL生成评估中的不足，并构建含200+问题的VHDLBench基准。

详情

AI中文摘要

大型语言模型（LLM）在寄存器传输级（RTL）代码生成方面展现了令人印象深刻的能力，尤其是针对Verilog。然而，评估它们在其他硬件描述语言（HDL）上的性能，特别是VHDL，仍然有限，尽管其独特的语言特性（如更严格的语义规则）引入了与Verilog不同的评估考量。这种覆盖不足限制了对当前模型在不同结构和语义的硬件设计语言中泛化能力的全面理解。为弥补这一空白，我们引入了VHDLSuite，一个以基准为中心的可扩展VHDL生成评估基础设施，集成了自动基准合成、可执行验证和多模型诊断分析。首先，我们提出一个数据流水线，自动将Verilog设计及其配套测试平台转换为可执行的VHDL基准实例，随后基于VUnit/GHDL进行验证，确保每个发布的任务在VHDL环境中可编译、可运行且可一致检查。其次，我们引入VHDLBench，一个包含超过200个VHDL问题的基准，配有完整且经过验证的测试平台，覆盖广泛的复杂度级别。第三，我们广泛评估了最先进的LLM，并揭示了LLM辅助VHDL生成中的关键挑战。我们的发现为多语言硬件设计的未来工作提供了重要见解和支持。该数据流水线、基准和评估框架将开源。

英文摘要

Large Language Models (LLM) have shown impressive capabilities in Register Transfer Level (RTL) code generation, particularly for Verilog. However, evaluating their performance with other Hardware Description Languages (HDL), especially VHDL, remains limited although its distinct language characteristics, such as stricter semantic rules, introduce evaluation considerations that differ from Verilog. This lack of coverage restricts fully understanding of how well current models generalize across hardware design languages with differing structures and semantics. To address this gap, we introduce VHDLSuite, a benchmark-centered infrastructure for scalable VHDL generation evaluation, integrating automated benchmark synthesis, executable validation, and multi-model diagnostic analysis. First, we propose a data pipeline that automatically converts Verilog designs and their accompanying testbenches into executable VHDL benchmark instances, followed by VUnit/GHDL-based validation to ensure each released task is compilable, runnable, and consistently checkable in the VHDL environment. Second, we introduce VHDLBench, a benchmark with over 200 VHDL problems with complete and validated testbenches across a wide range of complexity levels. Third, we extensively evaluate cutting-edge LLMs and uncover key challenges specific on LLM-aided VHDL generation. Our findings provide important insights and support future work in multi-language hardware design automation.Our data pipeline, benchmark, and evaluation framework will be open-sourced.

URL PDF HTML ☆

赞 0 踩 0

2606.13733 2026-06-15 cs.IT cs.LG math.IT 新提交

How Task Structure Limits Multi-Agent Success: An Information-Theoretic Analysis

任务结构如何限制多智能体成功：一种信息论分析

Shi Pan, Ming Luo

发表机构 * University College London（伦敦大学学院）； University of Bristol（布里斯托大学）

AI总结通过信息论分析，证明在任务约束图连通性和有限通信下，多智能体系统的成功概率随最小割成本指数衰减，为系统设计提供指导。

详情

AI中文摘要

多智能体系统（MAS）曾预期通过协作克服单智能体系统（SAS）的局限性。然而，在任务约束图的典型性条件和有界智能体间通信下，我们证明MAS的成功概率与任务约束的连通性密切相关，其中每个智能体具有有限的信息处理能力。具体而言，成功概率随由任务约束图在智能体间划分产生的信息瓶颈呈指数衰减。我们将此量定义为每个任务潜在约束图的\emph{最小割成本}$C_{\min}$。该信息论界适用于具有外部反馈的开放系统和不具有外部反馈的封闭系统。我们在合成实验和来自SWE-bench提交的真实世界经验数据上验证了我们的理论。根据我们的框架，有效的MAS设计应结合任务固有约束与工程优化，当$\Cmin$较高时，实践者应重构任务而非简单扩展智能体或通信。

英文摘要

Multi-agent systems (MAS) were expected to overcome the limitation of single-agent systems (SAS) through collaboration. However, under typicality conditions on the task's constraint graph and bounded inter-agent communication, we prove that the success probability of a MAS is closely tied to the connectivity of task constraints, where each agent has limited information-processing capacity. Specifically, the success probability decays exponentially with an information bottleneck that emerges from partitioning the task's constraint graph among agents. We define this quantity as the \emph{minimum cut cost} $C_{\min}$ of the potential constraint graph of each task. This information-theoretic bound applies to both open systems with external feedback and closed systems without. We validate our theory on both synthetic experiments and real-world empirical data from SWE-bench submissions. From our framework, effective MAS design should incorporate task-inherent constraints alongside engineering optimization, and when $\Cmin$ is high, practitioners should restructure tasks rather than simply scaling agents or communication.

URL PDF HTML ☆

赞 0 踩 0

2606.13713 2026-06-15 q-bio.GN cs.AI 新提交

CisTransCell: Single-Cell Perturbation Prediction via Gene Function, Regulatory Control, and Cellular Context

CisTransCell：通过基因功能、调控控制和细胞上下文进行单细胞扰动预测

Wei Zhang, Xun Jiang, Yuesi Xi, Ming Tang

发表机构 * [q-bio.GN]

AI总结提出CisTransCell框架，结合调控序列和编码序列先验与细胞表达状态，建模扰动响应级联，实现零样本单细胞扰动预测。

详情

AI中文摘要

预测细胞对遗传扰动的转录反应是单细胞生物学中的一个核心问题，尤其是在零样本设置中，扰动基因或基因组合在训练中未见。一个主要困难是扰动效应不仅由表达状态决定：它们取决于扰动基因产物如何影响其他基因和蛋白质，这些下游因子如何作用于顺式调控元件，以及当前细胞状态中哪些调控程序活跃。为了更好地捕捉这种生物复杂性，我们提出了CisTransCell，一个用于单细胞扰动预测的细胞条件多模态框架，它为每个基因补充了两个互补先验：一个调控序列先验，捕捉基因如何被调控；一个编码序列先验，捕捉基因产物做什么。通过将这些先验与细胞表达状态整合，CisTransCell将扰动响应建模为从基因功能到调控控制再到下游转录变化的级联。在基准单细胞扰动数据集上的实验表明，CisTransCell在零样本扰动预测中取得了强劲性能。

英文摘要

Predicting cellular transcriptional responses to genetic perturbations is a central problem in single-cell biology, especially in the zero-shot setting where the perturbed gene or gene combination is unseen during training. A major difficulty is that perturbation effects are not determined by expression state alone: they depend on how the perturbed gene product influences other genes and proteins, how those downstream factors act on cis-regulatory elements, and which regulatory programs are active in the current cell state. To better capture this biological complexity, we propose CisTransCell, a cell-conditioned multi-modal framework for single-cell perturbation prediction that augments each gene with two complementary priors: a regulatory-sequence prior that captures how the gene is controlled, and a coding-sequence prior that captures what the gene product does. By integrating these priors with cellular expression state, CisTransCell models perturbation response as a cascade from gene function to regulatory control to downstream transcriptional change. Experiments on benchmark single-cell perturbation datasets show that CisTransCell achieves strong performance in zero-shot perturbation prediction.

URL PDF HTML ☆

赞 0 踩 0

2606.13709 2026-06-15 stat.ML cs.LG 新提交

LoMC: Localized Multidirectional Correction for Refusal Suppression in Routed Foundation Models

LoMC: 路由基础模型中拒绝抑制的局部多方向校正

Yan Hong, Kedong Xiu, Wei Li, Jun Lan, Huijia Zhu, Shuheng Zhou, Zhongcai Lyu, Weiqiang Wang, Jianfu Zhang

发表机构 * Ant Group（蚂蚁集团）； Zhejiang University（浙江大学）； Shanghai Jiao Tong University（上海交通大学）

AI总结提出LoMC方法，通过支持门控干预框架在路由MoE和混合MoE模型中实现紧凑的拒绝抑制，提升非拒绝目标响应行为并保持通用能力。

详情

AI中文摘要

我们研究了路由MoE和混合MoE基础模型中的受控后训练拒绝抑制，旨在增加非拒绝目标响应行为，同时在紧凑的干预足迹下保持通用能力。现有的基于广泛方向的编辑可能会扰动通用计算，而仅支持专家编辑通常缺乏足够的容量来纠正异质拒绝表示。为了解决这一限制，我们引入了局部多方向校正（LoMC），一种支持门控干预框架，遵循支持-然后-校正的执行顺序：它首先识别紧凑的编辑支持，然后将原型校正方向聚合成逐层校正方向，最后仅在选定的支持内应用秩一逐层校正。通过使用编辑支持作为结构门控约束，LoMC在不扩大干预范围的情况下增加了校正容量。在四个路由骨干上的纯文本和多模态安全基准实验表明，LoMC在紧凑干预足迹下显著改善了非拒绝目标响应行为，同时保持了通用能力。

英文摘要

We study controlled post-training refusal suppression in routed MoE and hybrid-MoE foundation models, aiming to increase non-refusal target-response behavior while preserving general capability under a compact intervention footprint. Existing broad direction-based edits can perturb general-purpose computation, whereas support-only expert edits often lack sufficient capacity to correct heterogeneous refusal representations. To address this limitation, we introduce Localized Multidirectional Correction (LoMC), a support-gated intervention framework that follows a support-then-correction execution order: it first identifies a compact edit support, then aggregates prototype correction directions into layer-wise correction directions, and finally applies rank-one layer-wise correction only within the selected support. By using the edit support as a structural gating constraint, LoMC increases correction capacity without expanding the intervention scope. Experiments on text-only and multimodal safety benchmarks across four routed backbones show that LoMC substantially improves non-refusal target-response behavior while maintaining general capability under a compact intervention footprint.

URL PDF HTML ☆

赞 0 踩 0

2606.13706 2026-06-15 cs.AR cs.AI 新提交

HierSVA: A Data Synthesis Pipeline, Dataset, and Benchmark for LLM-Driven Hierarchical Hardware Formal Verification

HierSVA：面向LLM驱动的层次化硬件形式化验证的数据合成流水线、数据集与基准

Maohua Nie, Jiang Zhu, Jingqun Zhang, Zhichen Zeng, Jiayi Wang, Sibo Zhang, Jialin Wang, C. -J. Richard Shi

发表机构 * University of Washington（华盛顿大学）

AI总结提出HierSVA套件，包含数据合成流水线、数据集和基准，用于LLM驱动的层次化硬件形式化验证；通过RTL预处理与LLM在环流程生成SystemVerilog断言，并构建342模块数据集；设计六轴指标评估断言质量，揭示LLM在层次化验证中的性能与局限。

详情

AI中文摘要

我们提出了HierSVA，一个集流水线、数据集和基准于一体的集成套件，用于LLM驱动的层次化硬件形式化验证。HierSVA-SP将RTL预处理工具链与LLM在环形式化验证流程相结合，为层次化RTL生成参考SystemVerilog断言（SVA）。将其应用于BaseJump STL，得到HierSVA-DS数据集，包含342个模块，具有层次元数据和深度0-9，并附带28个模块-错误对的深层子集，包含自然语言规范和错误变体。HierSVA-B将断言质量分解为六个度量轴：语法正确性、断言证明成功率、空洞性、规范忠实度、突变覆盖率和形式化核心覆盖率。将HierSVA-B应用于12个最近的LLM，揭示了三个发现。第一，模块级编译率为67.1%；在可评估运行生成的断言中，82.1%被非空洞地证明，但相应的断言集仅检测到70.2%的可注入故障，并覆盖了36.2%的形式化核心。第二，在深层子集的211个可评估模型-模块条目中，断言集以0.87的召回率标记有错误的RTL，但预测有错误的输出中有40%在正确RTL上是假阳性，将精度限制在0.60。第三，代理模式改善了S1风格的可证明性和强度指标，但增益趋于平稳并振荡。代码和工件可在\href{ this https URL }{ this https URL }获取。数据集可在\href{ this https URL }{ this https URL }获取。

英文摘要

We present HierSVA, an integrated suite that combines a pipeline, dataset, and benchmark for LLM-driven hierarchical hardware formal verification. HierSVA-SP pairs an RTL preprocessing toolchain with an LLM-in-the-loop formal verification flow to produce reference SystemVerilog Assertions (SVA) on hierarchical RTL. Applying it to BaseJump STL yields HierSVA-DS, a dataset of 342 modules, with hierarchy metadata and depths 0--9, accompanied by a deep subset of 28 module-bug pairs with natural-language specifications and bug variants. HierSVA-B decomposes assertion quality into six metric axes: syntax correctness, assertion proof success rate, vacuity, specification faithfulness, mutation coverage, and formal core coverage. Applying HierSVA-B to twelve recent LLMs reveals three findings. First, the module-level compile rate is 67.1\%; among generated assertions in evaluable runs, 82.1\% prove non-vacuously, but the corresponding assertion sets detect only 70.2\% of eligible injected faults and cover 36.2\% of the formal core. Second, on 211 evaluable model--module entries in the deep subset, assertion sets flag buggy RTL with 0.87 recall, but 40\% of predicted-buggy outcomes are false positives on correct RTL, limiting precision to 0.60. Third, agentic mode improves S1-style provability and strength metrics, but gains plateau and oscillate. Codes and artifacts are available at \href{https://github.com/HierSVAAnon/HierSVACodeAndArtifacts}{https://github.com/HierSVAAnon/HierSVACodeAndArtifacts}. Dataset is available at \href{https://huggingface.co/datasets/AnonymousHierSVA/HierSVA}{https://huggingface.co/datasets/AnonymousHierSVA/HierSVA}.

URL PDF HTML ☆

赞 0 踩 0

2606.13704 2026-06-15 cs.CY cs.AI cs.LG 新提交

Position: AI Must Become Planet-Centered, Not Just Human-Centered

立场：AI 必须转向以行星为中心，而非仅以人为中心

Maria Perez-Ortiz

发表机构 * GitHub

AI总结本文提出以行星为中心的AI（PCAI）设计哲学，通过系统思维重新定位AI以应对全球性社会-生态系统挑战，并强调与全球议程对齐、系统感知基础、轨迹导向评估和可监测性。

Journal ref International Conference on Machine Learning (ICML 2026)

详情

AI中文摘要

这篇立场论文认为，当代AI范式不足以支持复杂的全球目标，并引入以行星为中心的AI（PCAI）作为一种设计哲学和研究议程，将AI重新定位为面向行星尺度的社会-生态系统及其长期轨迹。以行星为中心的方法植根于系统思维，将地球视为一个相互关联的整体，人类是其中的一部分。我们诊断了AI框架中反复出现的局限性，其中许多仍以人为中心，并展示了为什么这些局限性在当前以系统性风险、非平稳性和深度不确定性为特征的行星条件下变得尤为重要。然后，我们阐述了PCAI如何重塑AI生命周期，从问题制定和模型设计到评估和部署，通过强调与全球议程对齐、开发系统感知的AI基础、轨迹导向的评估和可监测性。最后，我们提出一个可证伪的主张：没有明确考虑系统性后果而优化的AI系统更可能加剧系统性不稳定，而不是缓解它。

英文摘要

This position paper argues that contemporary AI paradigms are insufficient for supporting complex global goals and introduces Planet-Centered AI (PCAI) as a design philosophy and research agenda that reorients AI toward planetary-scale socio-ecological systems and their long-term trajectories. A planet-centered approach is grounded in systems thinking, treating Earth as an interconnected whole of which humans are part. We diagnose recurring limitations across AI frameworks, many of which remain human-centered, and show why these become especially consequential under current planetary conditions characterized by systemic risk, non-stationarity, and deep uncertainty. We then articulate how PCAI reshapes the AI lifecycle, from problem formulation and model design to evaluation and deployment, by emphasizing alignment with global agendas, developing system-aware AI foundations, trajectory-oriented evaluation, and monitorability. Finally, we advance a falsifiable claim: AI systems optimized without explicit consideration of systemic consequences are more likely to exacerbate systemic instability than to mitigate it.

URL PDF HTML ☆

赞 0 踩 0

2606.13700 2026-06-15 eess.SP cs.CV 新提交

C-MambaPose: A Physics-Informed Complex Mamba Framework for Cross-Environment WiFi Human Pose Estimation

C-MambaPose：一种物理信息驱动的复杂Mamba框架用于跨环境WiFi人体姿态估计

Phuc Nguyen H

发表机构 * VinUniversity（文大学）

AI总结提出C-MambaPose，一种结合物理信息的复值Mamba-GraFormer混合框架，通过相位保留表示和动态选择性感受野的时空复杂Mamba编码器，实现跨环境WiFi三维人体姿态估计，在MM-Fi数据集上以3.78M参数达到SOTA。

详情

AI中文摘要

利用无线WiFi信号进行人体姿态估计（HPE）因其无设备、保护隐私、抗遮挡和弱光等优点而成为一项有前景的技术。然而，现有方法往往忽略WiFi信号的物理复相位信息，并且由于严重的域偏移而无法在多样环境中泛化。在本文中，我们提出C-MambaPose，一种物理信息驱动的复值Mamba-GraFormer混合框架，用于鲁棒的跨环境WiFi三维人体姿态估计。我们的框架首先净化原始WiFi信道状态信息（CSI）相位误差，并构建保持相位的复值表示。然后，我们采用具有动态选择性感受野的时空复杂Mamba编码器来捕获细粒度的相位动态。一个交叉注意力联合查询映射器将非结构化序列标记映射到人体关节，这些关节由图卷积网络（GCN）解码以预测解剖学一致的3D坐标。在MM-Fi数据集上的广泛评估表明，C-MambaPose在所有设置下均达到与最先进基线竞争或更优的性能，特别是在具有挑战性的跨环境分割上设立了新的最先进水平，仅需3.78M参数——相比GraphPose-Fi减少83.1%，相比MetaFi++减少85.7%，同时保持与DT-Pose相当的大小（仅小18%），但无需任何预训练即可实现显著优越的性能。我们的代码在此https URL公开。

英文摘要

Human pose estimation (HPE) utilizing wireless WiFi signals has emerged as a promising technology owing to its device-free nature, privacy preservation, and robustness against occlusion and poor lighting. However, existing methods often overlook the physical complex phase information of WiFi signals and fail to generalize across diverse environments due to severe domain shifts. In this paper, we present C-MambaPose, a physics-informed complex-valued Mamba-GraFormer hybrid framework for robust cross-environment WiFi-based 3D HPE. Our framework first sanitizes raw WiFi Channel State Information (CSI) phase errors and constructs a phase-preserving complex-valued representation. We then employ a Spatiotemporal Complex Mamba encoder with a dynamic selective receptive field to capture fine-grained phase dynamics. A cross-attention joint-query mapper maps the unstructured sequence tokens to human joints, which are decoded by a Graph Convolutional Network (GCN) to predict anatomically coherent 3D coordinates. Extensive evaluations on the MM-Fi dataset show that C-MambaPose achieves competitive or superior performance to state-of-the-art baselines across all settings, setting a new state-of-the-art specifically on the challenging cross-environment split, requiring only 3.78 M parameters-an 83.1\% reduction compared to GraphPose-Fi~\cite{chen2026graph} and an 85.7\% reduction compared to MetaFi++~\cite{zhou2023metafi++}, while maintaining a comparable size to DT-Pose~\cite{chen2025towards} (which is only 18\% smaller) but achieving significantly superior performance without requiring any pretraining. Our code is publicly available at https://github.com/phucngvinuni/cmampose.git.

URL PDF HTML ☆

赞 0 踩 0

2606.13698 2026-06-15 eess.SY cs.AI cs.LG cs.NI cs.PF cs.SY 新提交

Active Inference for Adaptive Traffic Signal Control in Noisy Nonstationary IoT Environments

嘈杂非平稳物联网环境下自适应交通信号控制的主动推理方法

Dénes Toth, George Ambroladze, Edwin Sundberg, Ali Beikmohammadi, Alfreds Lapkovskis

发表机构 * Department of Computer Systems and Sciences（计算机系统与科学系）； Stockholm University（斯德哥尔摩大学）

AI总结提出一种基于主动推理的交通信号控制器，通过最小化期望自由能动态选择相位，在传感器遮挡、天气衰减和非平稳需求下优于深度Q网络和规则方法，降低空闲时间和CO2排放。

Comments Submitted to IEEE 12th World Forum on Internet of Things (WF-IoT) 2026

详情

AI中文摘要

在物联网化交叉口的城市交通信号控制必须在传感器遮挡、天气衰减和非平稳需求下保持有效。传统控制器在这些条件下性能下降，学习策略难以审计。为应对这些挑战，我们提出一种针对四臂信号交叉口的主动推理控制器，通过最小化关于各方向拥堵水平的高斯信念的期望自由能（EFE）动态选择相位，形成完全可追踪的决策流程。我们在SUMO交通模拟器中，将控制器与基于规则的启发式方法和深度Q网络（DQN）进行对比，涵盖四种逐渐增加噪声和非平稳性的场景，包括传感器遮挡、恶劣天气和随机事故。每个场景进行100次独立随机评估，主动推理在噪声最大的场景中实现了最低的空闲时间和CO2排放（分别为56,977秒和29.12千克，而DQN为71,741秒和30.56千克）。这些收益以公交优先服务率和相位切换频率的适度代价为代价。

英文摘要

Urban traffic signal control at IoT-instrumented intersections must remain effective under sensor occlusion, weather attenuation, and nonstationary demand. Conventional controllers degrade under these conditions, and learned policies remain difficult to audit. To address these challenges, we propose an active inference controller for a four-arm signalized intersection that dynamically selects phases by minimizing expected free energy (EFE) over Gaussian beliefs about per-direction congestion levels, yielding a fully traceable decision pipeline. We benchmark the controller in a SUMO traffic simulator against a rule-based heuristic and a deep Q-network (DQN) across four scenarios that progressively increase noise and nonstationarity, spanning sensor occlusion, adverse weather, and stochastic accidents. Across 100 independent random evaluations per scenario, active inference attains the lowest idle times and CO2 emissions in the noisiest scenarios (56,977 s and 29.12 kg vs. 71,741 s and 30.56 kg for DQN). These gains come at a modest cost in bus priority service rate and phase switch frequency.

URL PDF HTML ☆

赞 0 踩 0

2606.13696 2026-06-15 cs.CY cs.LG cs.MA cs.SI 新提交

AGORA: Can Deliberation and Governance Gates Absorb Participation Bias in Transit Planning?

AGORA: 审议与治理门能否吸收公交规划中的参与偏差？

Jung-Hoon Cho, Cathy Wu

发表机构 * Department of Civil and Environmental Engineering and Laboratory for Information & Decision Systems, Massachusetts Institute of Technology（土木与环境工程系和信息与决策系统实验室，麻省理工学院）； Institute for Data, Systems, and Society, Massachusetts Institute of Technology（数据、系统与社会研究所，麻省理工学院）

AI总结提出AGORA框架，通过固定网络、需求和求解器，系统变化会议组成、结构化审议和治理门，发现审议是参与影响结果的关键机制，治理门可压缩跨剖面方差，将参与偏差从不可控输入重构为过程设计问题。

详情

AI中文摘要

公交网络设计不仅依赖于优化算法，还取决于谁出现在公众听证会上。当前实践通常收集来自自选参与者的单向评论，使参与者构成成为结果变化的不可控来源。我们提出AGORA框架，该框架固定网络、需求和求解器，同时通过利益相关者代理、结构化审议和治理门系统变化会议组成。在两个不同规模的标准基准网络上，我们发现：(i) 总体结果在不同构成之间变化很小，但在尾部风险和公平性差异方面，代表性抽样仍然倾向于优于偏斜构成；(ii) 没有审议时，构成不产生任何变化，表明审议是“谁出席影响结果”的机制；(iii) 治理门压缩了跨剖面方差而不改变Mandl上的平均结果，但在Mumford0上的低接受率表明阈值需要实例特定的校准。这些发现将参与偏差从不可控输入重新定义为过程设计问题：即使没有保证的代表性出席，结构良好的审议和治理标准也能显著减少结果对“谁在房间里”的依赖程度。

英文摘要

Transit network design depends not only on the optimization algorithm but also on who shows up to the public hearing. Current practice often collects one-directional comments from self-selected attendees, leaving participant mix as an uncontrolled source of outcome variation. We present AGORA, a framework that holds the network, demand, and solver fixed while systematically varying meeting composition through stakeholder agents, structured deliberation, and governance gates. Across two standard benchmark networks at different scales, we find that (i) aggregate outcomes vary little across compositions, but on tail risk and fairness disparity, representative sampling still tends to outperform skewed compositions; (ii) without deliberation, composition produces no variation at all, showing that deliberation is the mechanism through which who attends affects outcomes; and (iii) governance gates compress cross-profile variance without shifting the average outcome on Mandl, but low acceptance on Mumford0 shows thresholds require instance-specific calibration. These findings reframe participation bias from an uncontrollable input to a process-design problem: even without guaranteed representative attendance, well-structured deliberation and governance criteria can substantially reduce how much outcomes depend on who is in the room.

URL PDF HTML ☆

赞 0 踩 0

2606.13695 2026-06-15 physics.geo-ph cs.AI cs.LG 新提交

Korzhinskii-Net: Physics-Informed Neural Network for Sub-Surface Mineral Prospectivity Modelling

Korzhinskii-Net: 用于地下矿产潜力建模的物理信息神经网络

Boris Kriuk

发表机构 * The Hong Kong University of Science and Technology（香港科技大学）

AI总结提出Korzhinskii-Net，一种耦合达西流、热输运和反应速率的二维径向物理信息神经网络，在五个矿省四个矿种上平均PR-AUC达0.885，显著优于传统基线。

Comments 12 pages, 7 figures, 3 tables

详情

AI中文摘要

矿产潜力建模（MPM）支撑着勘探经济学，然而大多数操作流程简化为基于浅表地表代理训练的数据驱动分类器。这类模型对实际定位矿石的地下物理过程（热平流、流体流动和岩性依赖的沉淀）视而不见。我们提出Korzhinskii-Net，一个二维径向物理信息神经网络（PINN），它将达西流、平流-扩散热输运和softplus饱和反应速率耦合到一个可微的正演模型中，并由地表和遥感代理弱监督。该网络以Dmitri S. Korzhinskii（1899-1985）命名，其渗滤交代作用理论提供了物理框架。我们在五个矿省（涵盖四种矿种：诺里尔斯克（Ni-Cu-PGE）、佩琴加（Ni-Cu硫化物）、乌多坎（砂岩型Cu）、苏霍伊洛格（造山型Au）和米尔内（金伯利岩型钻石））上，采用公平、泄漏控制的5折交叉验证协议（含硬环形负样本）评估Korzhinskii-Net。Korzhinskii-Net的平均PR-AUC为0.885，而最强经典基线（梯度提升）为0.281；平均分位数排名为0.019，对比基线为0.413。这一改进在所有五个矿省和四个矿种系统中一致，表明即使仅受全球开放数据代理约束，物理信息可微模拟器也能恢复纯特征学习器系统性地遗漏的定位模式。我们将完整流程和评估工具开源。

英文摘要

Mineral prospectivity modelling (MPM) underpins exploration economics, yet most operational pipelines reduce to data-driven classifiers trained on shallow surface proxies. Such models are blind to the subsurface physics that actually localises ore: heat advection, fluid flow, and lithology-dependent precipitation. We present Korzhinskii-Net, a 2-D radial physics-informed neural network (PINN) that couples Darcy flow, advective-diffusive heat transport, and a softplus-saturated reaction rate into a single differentiable forward model, weakly supervised by surface and remote-sensing proxies. The network is named after Dmitri S. Korzhinskii (1899-1985), whose theory of infiltration metasomatism provides the physical scaffold. We evaluate Korzhinskii-Net on five ore provinces spanning four commodity classes -- Norilsk (Ni-Cu-PGE), Pechenga (Ni-Cu sulphide), Udokan (sandstone-hosted Cu), Sukhoi Log (orogenic Au), and Mirny (kimberlitic diamond) -- under a fair, leakage-controlled 5-fold cross-validation protocol with hard ring-shaped negatives. Korzhinskii-Net attains a mean PR-AUC of 0.885 versus 0.281 for the strongest classical baseline (gradient boosting), and a mean fractional rank of 0.019 versus 0.413. The improvement is consistent across all five provinces and four commodity systems, suggesting that physics-informed differentiable simulators, even when constrained only by global open-data proxies, can recover localisation patterns that pure feature-based learners systematically miss. We release the full pipeline and evaluation harness as open source.

URL PDF HTML ☆

赞 0 踩 0

2606.13694 2026-06-15 eess.SP cs.AI cs.LG 新提交

Efficient Temporal Modeling for Mobile Sleep Staging via Lightweight Random Attention

基于轻量随机注意力的移动睡眠分期高效时序建模

Guisong Liu, Pengfei Wei, Jainsong Zhang, Martin Dresler

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出轻量随机注意力模块RA，通过固定随机投影实现相似性聚合，替代可学习序列建模，在移动睡眠分期中实现高效时序平滑，理论解释为随机注意力先验核，实验显示在准确率和F1上提升1-3%，性能媲美LSTM/GRU/Transformer。

Comments 7 pages, 1 figures, 5 tables

详情

AI中文摘要

移动睡眠分期是家庭睡眠监测和闭环调节的基础设施。但现有的序列模型如RNN和Transformer在移动部署中计算成本高。本文提出随机注意力（RA），一种基于固定随机投影的轻量时序建模模块，用基于相似性的聚合替代可学习的序列建模。RA在历元编码器之外引入极少的额外参数，同时实现有效的时序平滑。我们进一步通过随机注意力先验核（RAPK）提供理论解释，将RA分解为全局平滑项和特征相似性项，为时序睡眠结构提供可解释的视角。在Sleep-EDF-20和Sleep-EDF-78上的实验表明，RA在准确率和F1分数上持续提升历元级基线1-3%，同时达到与LSTM、GRU和Transformer模型相竞争的性能。RA还展示了在不同骨干编码器上的强泛化能力，以及相对于传统时序平滑方法的改进鲁棒性。这些结果表明，通过轻量基于相似性的时序聚合可以实现高效的睡眠分期，使RA适用于实时可穿戴应用。

英文摘要

Mobile sleep staging serves as a foundational infrastructure for in-home sleep monitoring and closed-loop modulation. But existing sequential models such as RNNs and Transformers are computationally expensive for mobile deployment. In this paper, we propose Random Attention (RA), a lightweight temporal modeling module based on fixed random projections, which replaces learnable sequence modeling with similarity-based aggregation. RA introduces little additional parameters beyond the epoch encoder while enabling effective temporal smoothing. We further provide a theoretical interpretation via the Random Attention Prior Kernel (RAPK), which decomposes RA into a global smoothing term and a feature similarity term, offering an interpretable view of temporal sleep structure. Experiments on Sleep-EDF-20 and Sleep-EDF-78 show that RA consistently improves epoch-wise baselines by 1-3\% in accuracy and F1 score, while achieving competitive performance compared with LSTM, GRU, and Transformer models. RA also demonstrates strong generalization across different backbone encoders and improved robustness over conventional temporal smoothing methods. These results indicate that efficient sleep staging can be achieved through lightweight similarity-based temporal aggregation, making RA suitable for real-time wearable applications.

URL PDF HTML ☆

赞 0 踩 0

2606.13692 2026-06-15 cs.DB cs.AI 新提交

An Agentic Retrieval Framework for Autonomous Context-Aware Data Quality Assessment

一种面向自主上下文感知数据质量评估的智能体检索框架

Hadi Fadlallah, Ibrahim Dhaini, Fatima Mubarak, Rima Kilany

发表机构 * University of Sciences and Arts in Lebanon（利比亚科学与艺术大学）； Saint-Joseph University of Beirut（贝鲁特圣约瑟夫大学）

AI总结提出一种智能体检索框架，通过多智能体工作流理解数据使用意图、推导上下文感知评估策略并生成可执行验证逻辑，引入可行性验证阶段确保可靠性，实验表明能自适应不同使用场景并减少不可执行规则。

Comments 26 pages, 18 figures, Submitted to the International Journal of Intelligent Information and Database Systems

详情

AI中文摘要

数据质量评估是有效数据分析和数据驱动决策的关键前提，但由于数据质量固有的上下文依赖性，它仍然是一项具有挑战性的任务。现有方法通常依赖静态规则或手动评估策略，限制了它们对不同使用场景的适应性，并制约了大规模自动化。人工智能的最新进展，特别是大语言模型，为自动化数据质量评估提供了新机遇，但也引发了与可靠性、基础性和执行安全性相关的担忧。在本文中，我们提出了一种统一的智能体检索框架，用于自主上下文感知数据质量评估。该框架解释预期数据使用的自然语言描述，推导上下文感知评估策略，并通过多智能体工作流生成可执行验证逻辑。为确保操作可靠性，该框架引入了一个可行性验证阶段，在执行前评估生成的评估规范的真实性和可执行性，从而在必要时进行迭代改进。接受的验证逻辑被确定性地执行，以保证可重复和可审计的结果。我们将所提出的框架实现为一个端到端原型，并在同一数据集上的多个使用场景中进行了评估。结果表明，评估结果能够有意义地适应不同的预期用途，而可行性门控执行减少了不切实际或不可执行规则的生成。所提出的方法为在现代数据驱动环境中部署自主且可控的数据质量评估提供了实用基础。

英文摘要

Data quality assessment is a critical prerequisite for effective data analytics and data-driven decision-making, yet it remains a challenging task due to the inherently context-dependent nature of data quality. Existing approaches often rely on static rules or manual assessment strategies, limiting their adaptability to diverse usage scenarios and constraining automation at scale. Recent advances in artificial intelligence, particularly large language models, offer new opportunities for automating data quality assessment, but raise concerns related to reliability, grounding, and execution safety. In this paper, we propose a unified agentic-retrieval framework for autonomous context-aware data quality assessment. The framework interprets natural-language descriptions of intended data usage, derives context-aware assessment strategies, and generates executable validation logic through a multi-agent workflow. To ensure operational reliability, the framework introduces a feasibility validation stage that evaluates the realism and executability of generated assessment specifications before execution, enabling iterative refinement when necessary. Accepted validation logic is executed deterministically to guarantee reproducible and auditable results. We implement the proposed framework as an end-to-end prototype and evaluate it across multiple usage scenarios applied to the same dataset. The results demonstrate that assessment outcomes adapt meaningfully to different intended uses, while feasibility-gated execution reduces unrealistic or non-executable rule generation. The proposed approach provides a practical foundation for deploying autonomous yet controlled data quality assessment in modern data-driven environments.

URL PDF HTML ☆

赞 0 踩 0

2606.13690 2026-06-15 cs.CY cs.CL 新提交

Indirect Computing Model with Indirect Formal Method

间接计算模型与间接形式化方法

Xiaohui Zou

发表机构 * Institute for Higher Education（高等教育研究院）； China University of Geosciences (Beijing)（中国地质大学（北京））； Institute of Synergistic Cultural Gene Engineering（协同文化基因工程研究院）； Tsinghua Science Park（清华大学科技园）； Sino-US Project: UC Berkeley Searle Research Bilingual Information Processing Group（中美项目：伯克利赛尔研究双语信息处理组）

AI总结本文从人机界面与协同计算程序结合的协同智能计算系统视角，探讨间接计算模型与间接形式化方法支持的优化云计算技术原理，并介绍兼容大小字符串的间接计算模型和形式理论，以中文信息数据为例展示原型设计，旨在将数据中心优化为知识中心。

Comments 10 pages, 6 figures

Journal ref Software 2011,32(5)

2606.13684 2026-06-15 cs.CY cs.AI cs.CL cs.LG 新提交

Cross-Dataset Bloom Question Classification: Supervised Models and Prompted LLMs

跨数据集布鲁姆问题分类：监督模型与提示式大语言模型

Abdolali Faraji, Mohammadreza Molavi, Zohreh Rasoulkhani, Mohammadreza Tavakoli, Gábor Kismihók

发表机构 * Leibniz Information Centre for Science and Technology（莱比锡信息科学与技术研究中心）； University of Genoa（热那亚大学）

AI总结评估监督ML/DL模型和LLM在跨数据集布鲁姆分类中的泛化能力，发现LLM更稳定，并基于最佳提示策略开发了轻量级UI。

Comments Accepted at AIED 2026. Abdolali Faraji and Mohammadreza Molavi contributed equally to this work

详情

AI中文摘要

自动对评估问题进行布鲁姆分类可以大幅减少教师工作量，但标注具有主观性且依赖教师。先前的机器学习和深度学习方法在数据集内表现良好，但很少在跨数据集设置中评估，导致现实世界的泛化能力不明确；同时，LLM在布鲁姆问题分类中的有效性尚未被系统研究。我们评估了现有ML/DL方法的跨数据集泛化能力，并在五个数据集上使用多种提示策略评估了LLM；最佳提示策略结合了上下文示例和课程特定的动作动词。监督ML/DL模型在未见数据集上性能大幅下降，而LLM更稳定，表明其在多样化教育环境中是一种稳健的替代方案。基于最佳提示策略，我们还开发了一个轻量级用户界面，支持教师自动分类大量问题库；可用性研究表明低工作量和高度可用性。

英文摘要

Automatic Bloom's taxonomy classification of assessment questions can substantially reduce instructor workload, but labeling is subjective and teacher-dependent. Prior machine learning (ML) and deep learning (DL) approaches reported strong within-dataset results, yet were rarely evaluated in cross-dataset settings, leaving real-world generalizability unclear; meanwhile, LLM effectiveness for Bloom question classification has not been systematically studied. We evaluated the cross-dataset generalization of existing ML/DL methods and assessed LLMs with multiple prompting strategies on five datasets; the best prompting strategy combined in-context examples with course-specific action verbs. Supervised ML/DL models degraded substantially on unseen datasets, whereas LLMs were more stable, suggesting a robust alternative across diverse educational contexts. Based on the best prompting strategy, we also presented a lightweight UI that supports instructors in automatically classifying large question banks; a usability study indicated low workload and high usability.

URL PDF HTML ☆

赞 0 踩 0

2606.12430 2026-06-15 cs.CY cs.AI 新提交

Will AI Agents Free Us From Meaningless Work? A Human-Centered Analysis

AI代理能否让我们摆脱无意义的工作？一项以人为中心的分析

Davide Ghia, Jaspreet Ranjit, Tania Cerquitelli, Daniele Quercia

发表机构 * Politecnico di Torino（都灵理工大学）； University of Southern California（南加州大学）； Nokia Bell Labs（诺基亚贝尔实验室）

AI总结基于Graeber的“狗屁工作”理论，通过任务级分析发现，工人感知的任务无意义程度强烈预测其对AI委托的意愿，且此类任务被认为需要较少人工监督。

Comments Improved overall writing; add details about task filtering and participants screening; add comments in the discussion about the subjective and context-specific nature of the scale introduced;

详情

DOI: 10.1145/3805029.3818299

AI中文摘要

一些人声称AI代理将把工人从工作中无聊的部分解放出来，但关于工人自己如何识别哪些任务应该被自动化，我们知之甚少。先前的研究侧重于职业，忽略了在同一角色内，工人在不同任务中体验到不同层次的意义。我们通过基于Graeber的“狗屁工作”理论的任务级分析来解决这一差距。使用202名工人对171项工作任务的评分，我们(1)验证了一个五维度的感知无意义量表，(2)表明感知无意义强烈预测对AI委托的渴望，以及(3)发现这些任务也被视为需要较少的人工监督。总之，这些发现表明，被视为无意义的任务是AI委托的自然候选者，将工人的偏好与感知可行性对齐。

英文摘要

Some claim that AI agents will free workers from the boring parts of their jobs, yet little is known about how workers themselves identify which tasks should be automated. Prior research focuses on occupations, overlooking that workers experience varying levels of meaning across tasks within the same role. We address this gap with a task-level analysis grounded in Graeber's theory of bullshit jobs. Using ratings from 202 workers on 171 workplace tasks, we (1) validate a five-item scale of perceived bullshitness, (2) show that perceived bullshitness strongly predicts desire for AI delegation, and (3) find that such tasks are also seen as requiring less human oversight. Together, these findings suggest that tasks perceived as bullshit are natural candidates for AI delegation, aligning worker preferences with perceived feasibility.

URL PDF HTML ☆

赞 0 踩 0

2605.04954 2026-06-15 cs.NE cs.LG 版本更新

On the Influence of the Feature Computation Budget on Per-Instance Algorithm Selection for Black-Box Optimization

特征计算预算对黑箱优化中逐实例算法选择的影响

Koen van der Blom, Diederick Vermetten

发表机构 * Centrum Wiskunde & Informatica（荷兰阿姆斯特丹数学与信息学中心）； Sorbonne Université（索邦大学）； CNRS（国家科学研究中心）； LIP6（LIP6实验室）

AI总结研究黑箱优化中特征计算预算对逐实例算法选择性能的影响，发现即使花费25%预算计算特征，PIAS仍可行，且最优预算比例高度依赖场景。

详情

AI中文摘要

逐实例算法选择（PIAS）利用一组算法之间的互补性，通过决定在给定实例上运行哪个算法来提升性能。该决策基于实例的特征，而在黑箱优化（BBO）的背景下，这些特征需要消耗一部分优化预算来计算。这引发了两个问题：(a) 在特征计算上花费多少比例的预算时，PIAS对BBO变得值得；(b) 哪个预算比例能优化特征准确性与PIAS性能之间的权衡。为此，我们进行了一项广泛的研究，将不同采样预算用于特征计算的PIAS与单一最佳算法在多种算法选择场景下进行比较。这些场景包括两种组合规模、三个问题集、四种维度以及十个目标预算。我们发现，在大多数测试场景中，PIAS是可行的，即使将总预算的四分之一用于特征计算。用于特征计算的预算比例以最大化PIAS收益的权衡高度依赖于具体的算法选择场景。此外，平均而言，PIAS相对于虚拟最佳求解器的损失中有20%可归因于特征计算预算，这凸显了适当考虑特征预算的重要性。

英文摘要

Per-instance algorithm selection (PIAS) takes advantage of complementarity between a set of algorithms by deciding which algorithm to run on a given instance. This decision is based on features of the instances, which, in the context of black-box optimization (BBO), require a part of the optimization budget to be computed. This raises two questions: (a) from which fraction of the budget spent on feature computation does PIAS become worth it for BBO, and (b) which fraction of the budget optimizes the tradeoff between feature accuracy and PIAS performance. To this end, we perform a broad study where PIAS with varying sampling budgets for feature computation is compared to the single best algorithm on a broad range of algorithm selection scenarios. These scenarios consist of two portfolio sizes, three problem sets, 4 dimensionalities, and 10 target budgets. We find that PIAS is viable for the majority of tested scenarios, even when as much as a quarter of the total budget is spent on feature computation. The tradeoff for the fraction of the budget spent on feature computation to maximize the benefit of PIAS is highly dependent on the specific AS scenario. Further, on average 20 percent of PIAS loss to the virtual best solver is explained by the budget spent on feature computation, highlighting the importance of properly accounting for the feature budget.

URL PDF HTML ☆

赞 0 踩 0

AI 大模型

视觉与机器人

科学与医疗

Temporally Consistent Graph Q-Networks for Intelligent Network Control

Safety-Contract Graph Multi-Agent Reinforcement Learning for Autonomous Network Security Response

AI can help scientists publish less

Scalable Deep Unfolding of Conic Optimizers

Aligning Quantum Operators with Large Language Models

A Benchmark and Framework for Evaluating Next Action Predictions in Spreadsheets

The Program Is Still There: A Conservation Law for Program Discovery

Recursively Trained Diffusion Models: Limiting Collapse Distribution and Spectral Characterization

Conformal calibration and look-elsewhere effect in anomaly detection for new-physics searches

SEVRA-BENCH: Social Engineering of Vulnerabilities in Review Agents

Position: Align AI to Our Aspirations, Not Our Flaws

BigPower: Hierarchical Source-Level Module Power Estimation for CPUs with Large Language Models

A Virtuous AI is an Existential Risk

FreoStream:Enhancing Stream Guardrails via Future-Aware Reasoning and Safety-Aligned Optimization

VHDLSuite: Unified Pipeline for LLM VHDL Generation with Data Synthesis and Evaluation

How Task Structure Limits Multi-Agent Success: An Information-Theoretic Analysis

CisTransCell: Single-Cell Perturbation Prediction via Gene Function, Regulatory Control, and Cellular Context

LoMC: Localized Multidirectional Correction for Refusal Suppression in Routed Foundation Models

HierSVA: A Data Synthesis Pipeline, Dataset, and Benchmark for LLM-Driven Hierarchical Hardware Formal Verification

Position: AI Must Become Planet-Centered, Not Just Human-Centered

C-MambaPose: A Physics-Informed Complex Mamba Framework for Cross-Environment WiFi Human Pose Estimation

Active Inference for Adaptive Traffic Signal Control in Noisy Nonstationary IoT Environments

AGORA: Can Deliberation and Governance Gates Absorb Participation Bias in Transit Planning?

Korzhinskii-Net: Physics-Informed Neural Network for Sub-Surface Mineral Prospectivity Modelling

Efficient Temporal Modeling for Mobile Sleep Staging via Lightweight Random Attention

An Agentic Retrieval Framework for Autonomous Context-Aware Data Quality Assessment

Indirect Computing Model with Indirect Formal Method

Cross-Dataset Bloom Question Classification: Supervised Models and Prompted LLMs

Will AI Agents Free Us From Meaningless Work? A Human-Centered Analysis

On the Influence of the Feature Computation Budget on Per-Instance Algorithm Selection for Black-Box Optimization