arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2606.11508 2026-06-11 cs.LG q-bio.QM 新提交

Probabilistic Contrastive Pretraining for Multi-task ADME Property Prediction

概率对比预训练用于多任务ADME性质预测

Yifan Xue, Srimukh Prasad Veccham, Saee Paliwal, Tyler Shimko, Micha Livne

发表机构 * NVIDIA（英伟达）

AI总结提出分子图-Transformer预训练框架，结合化学自监督与对比互信息，通过统一概率潜变量目标优化重构、对比和化学任务，在多任务微调中采用任务特定MLP头，在三个数据集上平均提升7.6%-9.5%。

详情

AI中文摘要

准确预测吸收、分布、代谢和排泄（ADME）性质对药物发现至关重要，但由于ADME终点存在噪声、相互依赖且数据有限，仍然具有挑战性。我们提出了一种分子图-Transformer预训练框架，结合了化学特异性自监督与对比互信息机器学习（cMIM）。我们的方法将分子图编码为潜变量，从图导出的潜代码重建SMILES字符串，并用领域特定的自监督化学任务增强对比目标。我们不是将这些任务视为具有单独调整损失权重的辅助正则化器，而是将重建、对比判别和化学特异性监督表述为单个概率潜变量目标中的单位权重对数概率因子。对于微调，我们提出了一种具有任务特定多层感知器头的多任务GNN读出架构，在保留共享表示学习的同时减轻负迁移并改进异质非线性任务关系的建模。在Biogen、ExpansionRX和ChEMBL-MT上，所得到的对比KERMT预训练相比KERMT基线分别提高了7.6%、9.9%和9.5%（在显著改进的终点上平均）。将ADME邻近分子添加到预训练语料库进一步改善了迁移，并且对比组件锐化了化学上有意义的潜邻域。

英文摘要

Accurate prediction of absorption, distribution, metabolism, and excretion (ADME) properties is critical to drug discovery, but remains challenging because ADME endpoints are noisy, interdependent, and often data-limited. We propose a molecular graph-transformer pretraining framework that combines chemistry-specific self-supervision with contrastive mutual information machine learning (cMIM). Our method encodes molecular graphs into latent variables, reconstructs SMILES strings from the graph-derived latent codes, and augments the contrastive objective with domain-specific self-supervised chemistry tasks. Rather than treating these tasks as auxiliary regularizers with separately tuned loss weights, we formulate reconstruction, contrastive discrimination, and chemistry-specific supervision as unit-weighted log-probability factors in a single probabilistic latent-variable objective. For fine-tuning, we propose a multi-task GNN readout architecture with task-specific multilayer perceptron heads, preserving shared representation learning while mitigating negative transfer and improving the modeling of heterogeneous, nonlinear task relationships. Across Biogen, ExpansionRX, and ChEMBL-MT, the resulting Contrastive KERMT pretraining improves over the KERMT baseline by 7.6%, 9.9%, and 9.5% respectively (averaged over significantly-improved endpoints). Adding ADME-adjacent molecules to the pretraining corpus further improves transfer, and the contrastive component sharpens chemically meaningful latent neighborhoods.

URL PDF HTML ☆

赞 0 踩 0

2606.11507 2026-06-11 cs.CV 新提交

SceneMiner: Identity-Preserving Multi-Task Fine-Tuning for Unified BEV Scene Mining

SceneMiner: 保持身份的多任务微调用于统一BEV场景挖掘

Abdalmalek Aburaddaha, Venkatraman Narayanan, Keval Thaker, Samir A. Rawashdeh

发表机构 * University of Michigan-Dearborn（密歇根大学迪尔伯恩分校）

AI总结提出SceneMiner，一种统一的仅相机鸟瞰图管道，通过冻结视觉语言骨干网络在单次前向传播中发出互补的挖掘信号，并发现跨任务干扰问题，通过零初始化新子模块和冻结共享流参数的身份保持多任务微调解决。

详情

AI中文摘要

从驾驶日志中挖掘困难、安全关键的场景受到缺乏难度标签的瓶颈，且没有单一的代理（碰撞风险、轨迹歧义或语义稀有性）足以单独找到这些场景。我们提出SceneMiner，一种统一的、仅相机的鸟瞰图管道，从冻结的视觉语言骨干网络在单次前向传播中发出互补的挖掘信号，无需激光雷达或雷达：用于文本提示场景搜索的检索嵌入、多标签场景标签分布以及连续的基于物理的风险评分（运动预测是副产品，而非贡献）。构建这样的多头模型暴露了我们的核心发现，即我们称之为跨任务干扰的失败模式：添加或升级一个头会改变共享激活流并降低权重冻结的兄弟头，因此仅冻结参数是不够的。我们的贡献，即保持身份的多任务微调，通过零初始化每个新子模块并冻结每个馈入共享流的参数来消除这种干扰。挖掘头因此保持比特一致，同时仅训练约102k参数。标签头通过将每个场景池化为32个视觉令牌，在20个场景标签上达到mAP 0.4614（micro-F1 0.5557），嵌入头支持文本提示检索，经定性验证。代码可在以下网址获取：this https URL

英文摘要

Mining hard, safety-critical scenes from driving logs is bottlenecked by the absence of difficulty labels, and no single proxy, collision risk, trajectory ambiguity, or semantic rarity suffices to find such scenes on its own. We present SceneMiner, a unified, camera-only bird's-eye-view pipeline that emits complementary mining signals from a frozen vision-language backbone in a single forward pass, with no LiDAR or radar: a retrieval embedding for text-prompted scenario search, a multi-label scene-tag distribution, and a continuous physics-based risk score (a motion forecast is a byproduct, not a contribution). Building such a multi-head model exposes our central finding, a failure mode we term cross-task interference: adding or upgrading one head shifts a shared activation stream and degrades weight-frozen sibling heads, so freezing parameters alone is insufficient. Our contribution, identity-preserving multi-task fine-tuning, removes this interference by zero-initializing every new sub-module and freezing every parameter that feeds the shared stream. The mining heads are thereby preserved bit-identically while training only ~102k parameters. The tagging head reaches mAP 0.4614 (micro-F1 0.5557) on 20 scene tags by pooling each scene into 32 visual tokens, and the embedding head supports text-prompted retrieval, validated qualitatively. Code is available at: https://anonymous.4open.science/r/sceneminer_anonymous-64E5

URL PDF HTML ☆

赞 0 踩 0

2606.11505 2026-06-11 cs.CV cs.AI cs.CR 新提交

On the Study of Biometric Spoofing Detection using Deep Learning

基于深度学习的生物特征欺骗检测研究

Kumar Kartikey, Nikos Komninos

发表机构 * School of Science and Technology（科学与技术学院）

AI总结评估MobileNetV2、DenseNet-121、Inception-v3和STD模型在面部识别系统欺骗检测中的性能，MobileNetV2以92%准确率最优，适合实际应用。

详情

AI中文摘要

生物特征系统越来越多地部署在安全应用中；然而，它们仍然容易受到欺骗攻击，攻击者利用伪造的生物特征数据获取未经授权的访问。本研究评估了最先进的机器学习模型MobileNetV2、DenseNet-121、Inception-v3和欺骗痕迹解缠（STD）在面部识别系统中检测欺骗攻击的有效性。使用CelebA-Spoof数据集，研究通过准确率、精确率、召回率和F1分数等指标评估模型有效性。在MSU-MFSD数据集上进行跨数据集验证以评估泛化能力。结果表明MobileNetV2是最有效的模型，在平衡计算效率的同时达到92%的准确率，使其适用于实际应用。Inception-v3表现出中等鲁棒性，而DenseNet-121和STD在泛化方面存在困难。研究结果强调了在领域自适应和混合架构方面取得进展以增强生物特征安全系统的必要性。

英文摘要

Biometric systems are increasingly deployed in security applications; however, they remain vulnerable to spoofing attacks, in which attackers exploit counterfeit biometric data to gain unauthorized access. This research evaluates the effectiveness of state-of-the-art machine learning models, MobileNetV2, DenseNet-121, Inception-v3, and Spoof Trace Disentanglement (STD) in detecting spoofing attacks within facial recognition systems. Using the CelebA-Spoof dataset, the study evaluates model effectiveness using metrics such as accuracy, precision, recall, and F1 Score. Cross-dataset validation is carried out on the MSU-MFSD dataset to assess generalizability. The results show MobileNetV2 as the most efficient model, achieving 92% accuracy while balancing computational effectiveness, making it appropriate for real-life applications. Inception-v3 shows moderate robustness, while DenseNet-121 and STD struggle with generalization. The findings highlight the need for advances in domain adaptation and hybrid architectures to enhance biometric security systems.

URL PDF HTML ☆

赞 0 踩 0

2606.11499 2026-06-11 cs.CL cs.AI 新提交

Hubs or Fringes: Pretraining Data Selection via Web Graph Centrality

枢纽或边缘：基于网页图中心性的预训练数据选择

Vedant Badoni, Danqi Chen, Xinyi Wang

发表机构 * Princeton Language and Intelligence（普林斯顿语言与智能）； Princeton University（普林斯顿大学）

AI总结提出WebGraphMix框架，利用Common Crawl主机级网页图的结构中心性得分调整预训练数据中中心与边缘文档的比例，无需模型训练或标注数据，在400M和1B参数模型上平均性能提升至41.4%。

Comments 10 pages

详情

AI中文摘要

现代语言模型的性能关键取决于预训练数据的组成。然而，现有的数据选择方法依赖辅助分类器进行文档评分或混合优化，增加了计算开销和对标注数据的依赖。我们提出WebGraphMix，一个轻量级的数据选择框架，它计算Common Crawl主机级网页图的结构中心性得分，并用其改变预训练混合数据中中心文档与边缘文档的比例。我们假设中心主机使模型暴露于可重用的抽象知识，而边缘主机编码专门的、长尾知识。WebGraphMix在网页规模下高效计算中心性得分，无需模型训练、标注数据或下游监督。我们将WebGraphMix集成到DataComp-LM流水线中，训练了400M和1B参数规模的模型，分别使用8B和28B token，在从事实知识到符号推理的23个任务上进行评估。实验表明，中心和边缘网页区域编码互补的能力。以1:1比例混合两者平均达到41.4%，而均匀采样为39.8%。将结构得分与文档级质量分类器得分相结合，性能进一步提升至43.8%。这些发现表明，网页图拓扑是预训练数据策展的一个有意义维度，捕获了与现有基于内容的方法大致正交的信息。

英文摘要

The performance of modern language models depends critically on pretraining data composition. Yet existing data selection methods rely on auxiliary classifiers for document scoring or mixture optimization, adding computational overhead and dependence on labeled data. We propose WebGraphMix, a lightweight data selection framework that computes structural centrality scores over the Common Crawl host-level web graph and uses them to vary the proportion of central versus peripheral documents in the pretraining mixture. We hypothesize that central hosts expose models to reusable abstractions, while peripheral hosts encode specialized, long-tail knowledge. WebGraphMix computes centrality scores efficiently at web scale, requiring no model training, labeled data, or downstream supervision. We integrate WebGraphMix into the DataComp-LM pipeline and train models at 400M and 1B parameter scales with 8B and 28B tokens respectively, evaluating on 23 tasks ranging from factual knowledge to symbolic reasoning. Our experiments show that central and peripheral web regions encode complementary capabilities. Mixture combining both at a ratio of 1:1 achieves 41.4% on average, compared to 39.8% for uniform sampling. Combining structural scores with document-level quality classifier scores further improves performance to 43.8%. These findings demonstrate that web graph topology is a meaningful axis for pretraining data curation, capturing information that is largely orthogonal to existing content-based approaches.

URL PDF HTML ☆

赞 0 踩 0

2606.11490 2026-06-11 cs.LG cs.SY eess.SY 新提交

OmniLoc: A Geometry-Aware Foundation Model for Anchor-Free UE Localization Across Diverse Indoor Environments

OmniLoc: 一种几何感知的基础模型，用于跨多样室内环境的无锚点用户设备定位

Lei Chu, Yuning Zhang, Omer Gokalp Serbetci, Anushka Katiyar, Bassel Abou Ali Modad, Andreas F. Molisch

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结提出OmniLoc，首个基于无线测量的基础模型，通过统一输入分词、几何感知Transformer和几何感知位置估计模块，实现跨室内环境的鲁棒无锚点定位，显著优于现有方法。

详情

AI中文摘要

由于建筑几何形状、可检测接入点（AP）集合以及接收信号异质性的显著变化，基于无线测量的室内定位在大规模部署中仍然具有挑战性。现有的基于学习的方法通常仅在有限环境下表现良好，并在环境变化下性能下降，使得在多样室内环境中进行鲁棒的无锚点定位变得极其困难。本文提出OmniLoc，一种环境交互式基础模型，用于跨多样室内环境的无锚点用户设备定位。据我们所知，OmniLoc是首个直接基于无线测量构建的用于此任务的基础模型。OmniLoc基于三个关键设计。首先，统一输入分词模块将异构无线测量转换为更易于学习的通用表示。其次，几何感知Transformer通过强调主导AP同时聚合来自辅助AP的互补证据，执行AP感知特征提取。第三，几何感知位置估计模块根据几何嵌入进行回归，以生成几何一致的位置预测。我们在大规模内部数据集和公共基准数据集上评估OmniLoc。结果表明，OmniLoc显著优于现有方法，当其设计组件集成时能持续改进现有骨干网络，并在跨环境评估中展现出强大的泛化能力。

英文摘要

Indoor localization from wireless measurements remains challenging in large-scale deployments due to substantial variation in building geometry, the set of detectable access points (APs), and the heterogeneity of received signals. Existing learning-based methods often perform well only in limited settings and degrade under environmental shifts, making robust anchor-free localization across diverse indoor environments notoriously difficult. In this paper, we present OmniLoc, an environment-interactive foundation model for anchor-free user equipment localization across diverse indoor environments. To the best of our knowledge, OmniLoc is the first foundation-model-based approach built directly on wireless measurements for this task. OmniLoc is built on three key designs. First, a unified input tokenization module converts heterogeneous wireless measurements into a common representation that is more amenable to learning. Second, a geometry-aware Transformer performs AP-aware feature extraction by emphasizing dominant APs while aggregating complementary evidence from supporting APs. Third, a geometry-aware location estimation module conditions regression on geometric embeddings to produce geometrically consistent location predictions. We evaluate OmniLoc on both a large-scale in-house dataset and a public benchmark dataset. Results show that OmniLoc significantly outperforms existing methods, consistently improves existing backbones when its design components are integrated, and demonstrates strong generalization in cross-environment evaluations.

URL PDF HTML ☆

赞 0 踩 0

2606.11489 2026-06-11 cs.RO 新提交

Steering Multirobot Behavior via Closed-Loop Affine Activation Editing

通过闭环仿射激活编辑引导多机器人行为

Satyajeet Das, Darren Chiu, Shashank Hegde, Gaurav S. Sukhatme

发表机构 * University of Southern California（南加州大学）

AI总结提出CLAE框架，在推理时通过编辑冻结策略的中间激活来引导多机器人行为，无需微调或重训练，并在多四旋翼导航任务中验证了速度控制、编队保持和规避监控等新行为。

详情

AI中文摘要

现实世界中的机器人需要适应超出其预训练策略范围的行为。策略微调或重训练是可选方案，但它们存在灾难性遗忘的风险，会降低预训练策略的基础性能。为了解决这个问题，我们引入了CLAE：闭环仿射激活编辑，这是一种推理时框架，通过编辑中间激活来引导冻结策略的行为，同时保持基础策略权重和下游动作头不变。CLAE将行为引导视为一个闭环问题，其输出编辑策略激活，这些激活在线适应机器人状态、环境、目标行为和多机器人上下文。它在冻结策略激活上训练稀疏自编码器，通过事后探测选择行为相关的潜在特征，并学习一个轻量级的基于强化学习的引导策略，在推理期间对所选潜在特征应用状态相关的仿射编辑。我们在一个冻结的多四旋翼导航策略上验证了CLAE，该策略训练用于执行单一任务：在避开障碍物的同时将机器人导航到一组目标位置。通过大量仿真和物理测试，我们表明，在导航到目标位置的同时，CLAE可以：1. 通过控制每个机器人的速度曲线来引导单个机器人行为；2. 通过保持期望的编队来协调多机器人行为；3. 产生全新的行为，其中机器人需要减少在环境中暴露于监控摄像头的机会。

英文摘要

Real-world robots need to adapt their behavior beyond the envelope of their pre-trained policy. Policy finetuning or retraining are options, but they risk catastrophic forgetting, degrading the pretrained policy's base performance. To combat this, we introduce CLAE: Closed-Loop Affine Activation Editing, an inference-time framework for steering the behavior of a frozen policy by editing intermediate activations while keeping the base policy weights and downstream action head untouched. CLAE approaches behavior steering as a closed-loop problem whose outputs edit policy activations that adapt online to the robot state, environment, target behavior, and multi-robot context. It trains a sparse autoencoder over frozen-policy activations, selects behavior-relevant latent features via post-hoc probing, and learns a lightweight RL-based steering policy that applies state-dependent affine edits to selected latents during inference. We validate CLAE on a frozen multi-quadrotor navigation policy trained to perform a single task: navigating robots to a set of goal locations while avoiding obstacles. Through extensive simulations and physical tests, we show that while navigating to their goal positions, CLAE can 1. steer individual robot behavior by controlling each robot's velocity profile; 2. coordinate multirobot behavior by preserving a desired formation; and 3. produce entirely new behavior wherein robots are required to reduce their exposure to surveillance cameras in the environment.

URL PDF HTML ☆

赞 0 踩 0

2606.11480 2026-06-11 cs.LG 新提交

Accurate and Resource-Efficient Federated Continual Learning

准确且资源高效的联邦持续学习

Jebacyril Arockiaraj, Dhruv Parikh, Jayashree Adivarahan, Rajgopal Kannan, Viktor Prasanna

发表机构 * University of Southern California（南加州大学）； DEVCOM Army Research Office（DEVCOM陆军研究办公室）

AI总结提出FedRAN框架，通过紧凑随机特征统计替代梯度更新，利用截断SVD降低通信开销，结合原型伪标签处理标签稀缺，在多个数据集上提升准确率并大幅降低资源消耗。

Comments Technical Report

详情

AI中文摘要

联邦持续学习（FCL）必须在有限的资源（如通信、计算、内存和标签可用性）下从分布式任务流中学习。现有的FCL方法通常依赖于重复的局部优化、重放和完全监督。解析替代方法避免了迭代训练和重放，但使用高维随机特征来提高准确性需要二阶特征统计量——Gram矩阵，其通信成本与随机特征大小$M$成二次方关系。我们提出FedRAN，一种资源感知的解析FCL框架，用紧凑的随机特征统计量替代基于梯度的更新。每个客户端传输其Gram矩阵的截断SVD摘要，将主要的二阶上传从$M$的二次方减少到线性（对于固定秩）。服务器执行两级QR-SVD子空间合并，在空间上跨客户端、在时间上跨任务，并以闭式求解岭分类器。FedRAN进一步通过基于原型的伪标签支持标签稀缺。在CIFAR-100、ImageNet-R和VTAB数据集上，FedRAN相比最强基线将平均准确率提高了最多4.8个百分点，每个客户端的通信量比基于优化的FCL少30.6-121.8倍，平均比基于梯度的基线快190.3倍；仅使用20%标签时，伪标签将平均准确率提高了最多6.61个百分点。这些结果表明，FedRAN在通信、计算和标签约束下实现了准确且资源高效的FCL。源代码可在该https URL获取。

英文摘要

Federated continual learning (FCL) must learn from distributed task streams under limited resources, such as communication, computation, memory, and label availability. Existing FCL methods often rely on repeated local optimization, replay, and full supervision. Analytic alternatives avoid iterative training and replay, but using high-dimensional random features to improve accuracy requires a second-order feature statistic, the Gram matrix, which has a quadratic communication cost in the random feature size $M$. We propose FedRAN, a resource-aware analytic FCL framework that replaces gradient-based updates with compact random feature statistics. Each client transmits a truncated-SVD summary of its Gram matrix, reducing the dominant second-order upload from quadratic to linear in $M$ for fixed rank. The server performs a two-level QR-SVD subspace merge, spatially across clients and temporally across tasks, and solves a ridge classifier in closed form. FedRAN further supports label scarcity through prototype-based pseudo-labeling. Across CIFAR-100, ImageNet-R, and VTAB datasets, FedRAN improves average accuracy by up to 4.8 percentage points over the strongest baseline, uses 30.6-121.8$\times$ less per-client communication than optimization-based FCL, and is 190.3$\times$ faster on average than gradient-based baselines; with only 20% labels, pseudo-labeling improves average accuracy by up to 6.61 points. These results show that FedRAN enables accurate and resource-efficient FCL under communication, computation, and label constraints. The source code is available at https://github.com/JebacyrilArockiaraj/Fed-RAN-SSL.

URL PDF HTML ☆

赞 0 踩 0

2606.11477 2026-06-11 cs.CV cs.AI 新提交

Towards Fully Automated Exam Grading: Fairness-Aware Recognition of Handwritten Answers with Foundation Models

迈向全自动考试评分：基于基础模型的笔迹答案公平性识别

Hartwig Grabowski

发表机构 * Institute for Machine Learning and Analytics (IMLA), Offenburg University（奥芬堡大学机器学习和分析研究所（IMLA））

AI总结提出使用视觉-语言基础模型（VLM）识别手写答案，在61份考试（3141个答案位置）上达到98.4%准确率，并通过轻量提示将假阴性率降至0.58%，实现公平的全自动评分。

Comments 11 pages, 2 figures, 3 tables

详情

AI中文摘要

手工批改手写试卷既耗时又容易出错，尤其是对于大规模班级，而全数字化考试往往迫使教学局限于封闭式问题格式。一个实用的折中方案是保留纸质、问题导向的任务，但将评估相关的答案以单个大写字母记录在机器可读的表格中。开放的问题是，这种读取能否足够准确，并且最重要的是，足够公平以实现无监督评分。早期的自动化方法仅达到约88%–91%的识别率——太低——并且在最关键的案例上失败：答案写在单元格外、被划掉或草书书写。我们展示了通用视觉-语言基础模型（VLM），它解释页面而非匹配像素模板，弥补了这一差距。在一个包含61份匿名考试（3141个答案位置）的基准测试中，最佳模型达到了98.4%的准确率，远高于之前的基线。关键的是，我们以公平性为中心进行评估：我们区分假阴性（正确答案被标记为错误，对学生不利）和假阳性，并且一个提供参考答案作为上下文的轻量提示将假阴性率降至0.58%。在示例性评分方案下，61份考试中只有3份会被评得更差，所有这些都通过学生自我审查步骤被发现。因此，大规模的全自动、公平性感知考试评分是合理的；我们发布匿名基准以支持可重复性。

英文摘要

Correcting handwritten exams by hand is time-consuming and error-prone, particularly for large cohorts, while fully digital exams tend to force a didactic narrowing towards closed question formats. A practical middle ground keeps paper-based, problem-oriented tasks but records the assessment-relevant answers as single capital letters in a table that a machine can read. The open question is whether this reading can be made accurate and, above all, fair enough for unsupervised grading. Earlier automated approaches reached only about 88%--91% recognition -- too low -- and failed on the cases that matter most: answers placed outside the cell, crossed out, or written in cursive. We show that general-purpose vision-language foundation models (VLMs), which interpret the page rather than match pixel templates, close this gap. On a benchmark of 61 anonymised exams (3141 answer positions) the best model reaches 98.4% accuracy, well above the previous baseline. Crucially, we centre the evaluation on fairness: we distinguish false negatives (a correct answer marked wrong, which disadvantages the student) from false positives, and a lightweight prompt that supplies the reference solution as context lowers the false-negative rate to 0.58%. Under an exemplary grading scheme only three of the 61 exams would be graded worse, all caught by a student self-review step. Fully automated, fairness-aware exam grading at scale is therefore defensible; we release the anonymised benchmark to support reproducibility.

URL PDF HTML ☆

赞 0 踩 0

2606.11473 2026-06-11 cs.LG cs.AI stat.ML 新提交

CRUMB: Efficient Prior Fitted Network Inference via Distributionally Matched Context Batching

CRUMB: 通过分布匹配上下文批处理实现高效先验拟合网络推理

Jamie Heredge, Mattia J. Villani, Pranav Deshpande, Akshay Seshadri, Niraj Kumar

发表机构 * Global Technology Applied Research, JPMorganChase（摩根大通全球技术应用研究）

AI总结提出CRUMB方法，通过聚类查询、最小化最大均值差异选择训练子集、再执行精确推理，在不重新训练的情况下加速先验拟合网络推理，在51个数据集上优于同类方法。

Comments 26 pages, 13 figures

详情

AI中文摘要

先验拟合网络（PFNs）是一类有前景的表格基础模型，执行上下文学习，其中整个带标签的训练集作为上下文提供，并在单次前向传播中生成测试查询的预测。然而，许多PFN架构中二次缩放的自注意力机制使得对于非常大的训练数据集推理变得不可行。我们提出CRUMB（使用最小化MMD批处理的聚类检索），一个三阶段推理包装器：（i）聚类测试查询，（ii）通过贪心最小化最大均值差异（MMD）为每个聚类选择一个小型、分布匹配的训练子集，（iii）在每个缩减上下文的批次上执行精确的PFN推理。CRUMB是架构无关的，无需重新训练。在51个数据集的TabArena基准测试中，跨三种PFN架构（TabPFNv2、TabICLv1、TabICLv2）评估，我们展示了CRUMB优于类似的最先进的上下文选择策略。我们还展示了CRUMB对协变量漂移具有鲁棒性，因为MMD最小化步骤自然有助于对齐训练上下文分布以匹配当前测试批次分布。

英文摘要

Prior-fitted networks (PFNs) are a promising class of tabular foundation models that perform in-context learning, whereby the entire labelled training set is supplied as context, and predictions for test queries are produced in a single forward pass. However, the quadratically scaling self-attention mechanism in many PFN architectures makes inference prohibitive for very large training datasets. We propose CRUMB (Clustered Retrieval Using Minimised-MMD Batching), a three-stage inference wrapper that (i) clusters the test queries, (ii) selects a small, distributionally matched training subset for each cluster by greedily minimising the maximum mean discrepancy (MMD), and (iii) runs exact PFN inference on each reduced-context batch. CRUMB is architecture-agnostic and requires no retraining. On the 51-dataset TabArena benchmark, evaluated across three PFN architectures (TabPFNv2, TabICLv1, TabICLv2), we show that CRUMB outperforms similar state-of-the-art context selection strategies. We also show that CRUMB is resilient to covariate drift, as the MMD-minimisation step naturally helps align the training context distribution to match the current test batch distributions.

URL PDF HTML ☆

赞 0 踩 0

2606.11470 2026-06-11 cs.CL 新提交

The Periodic Table of LLM Reasoning: A Structured Survey of Reasoning Paradigms, Methods, and Failure Modes

LLM推理的周期表：推理范式、方法与失败模式的结构化综述

Avinash Anand, Mahisha Ramesh, Avni Mittal, Ashutosh Kumar, Erik Cambria, Zhengkui Wang, Timothy Liu, Aik Beng Ng, Simon See, Rajiv Ratn Shah

发表机构 * Singapore Institute of Technology（新加坡理工大学）； Nvidia AI Center (SNAIC)（英伟达人工智能中心（SNAIC））； MIDAS Lab, IIIT Delhi（IIIT德里MIDAS实验室）； MIDAS Lab, IIT Mandi（IIT曼迪MIDAS实验室）； Owl Autonomous Imaging, Inc.（Owl自主成像公司）； College of Computing & Data Science, NTU Singapore（新加坡南洋理工大学计算与数据科学学院）； NVIDIA AI Technology Centre, Singapore（英伟达新加坡人工智能技术中心）； Department of Computer Science and Engineering, IIT Kanpur（IIT坎普尔计算机科学与工程系）

AI总结本文系统综述了300多篇论文，提出LLM推理研究的结构化分类法，涵盖多种推理范式，分析方法论趋势，并总结常见限制与失败模式，旨在为开发更鲁棒、可解释和可泛化的推理系统提供参考。

详情

AI中文摘要

大型语言模型（LLM）在自然语言处理任务中取得了强劲表现，但可靠推理仍是一个开放挑战。尽管现代LLM在结构化推理、多步问题求解和上下文理解方面显示出进展，但其推理行为往往不一致，且对提示策略、任务设计和模型规模敏感。本综述对来自arXiv、Semantic Scholar、Google Scholar、Papers with Code和ACL Anthology的300多篇近期论文进行了系统分析，以考察推理能力如何在LLM中涌现以及它们在何处失败。我们做出三项主要贡献。首先，我们引入了LLM推理研究的结构化分类法，涵盖思维链推理、多跳推理、数学推理、常识推理、视觉与时间推理、代码与算法推理、检索增强推理、工具增强与智能体推理以及基于强化学习的推理。其次，我们分析了这些范式中的方法论趋势，包括提示方法、模型架构、训练目标、奖励建模和评估基准。第三，我们综合了反复出现的局限性和失败模式，例如推理幻觉、脆弱的多步推理、弱的因果抽象以及差的跨域泛化。通过组织快速扩展的文献，本综述提供了LLM推理当前能力和局限性的统一视图。我们还识别了新兴研究方向，包括元推理、自进化推理框架、多模态推理和社会基础推理。总体而言，本工作旨在为未来语言模型中开发更鲁棒、可解释和可泛化的推理系统提供参考。

英文摘要

Large Language Models (LLMs) have achieved strong performance across natural language processing tasks, yet reliable reasoning remains an open challenge. Although modern LLMs show progress in structured inference, multi-step problem solving, and contextual understanding, their reasoning behavior is often inconsistent and sensitive to prompting strategies, task design, and model scale. This survey provides a systematic analysis of more than 300 recent papers from arXiv, Semantic Scholar, Google Scholar, Papers with Code, and the ACL Anthology to examine how reasoning capabilities emerge in LLMs and where they fail. We make three main contributions. First, we introduce a structured taxonomy of LLM reasoning research, covering Chain-of-Thought reasoning, multi-hop reasoning, mathematical reasoning, common sense reasoning, visual and temporal reasoning, code and algorithmic reasoning, retrieval-augmented reasoning, tool-augmented and agentic reasoning, and reinforcement learning-based reasoning. Second, we analyze methodological trends across these paradigms, including prompting methods, model architectures, training objectives, reward modeling, and evaluation benchmarks. Third, we synthesize recurring limitations and failure modes, such as reasoning hallucinations, brittle multi-step inference, weak causal abstraction, and poor cross-domain generalization. By organizing a rapidly expanding literature, this survey offers a unified view of the current capabilities and limitations of reasoning in LLMs. We also identify emerging research directions, including meta-reasoning, self-evolving reasoning frameworks, multimodal reasoning, and socially grounded reasoning. Overall, this work aims to serve as a reference for developing more robust, interpretable, and generalizable reasoning systems in future language models.

URL PDF HTML ☆

赞 0 踩 0

2606.11466 2026-06-11 cs.CV 新提交

PT-WNO: Point Transformer with Wavelet Neural Operator for 3D Point Cloud Semantic Segmentation

PT-WNO: 结合小波神经算子的点Transformer用于3D点云语义分割

Nhut Le, Maryam Rahnemoonfar

发表机构 * Lehigh University（里海大学）

AI总结针对点云语义分割中全局上下文不足的问题，提出PT-WNO，通过在跳跃连接旁集成可学习的小波神经算子分支捕获多尺度全局频谱上下文，在四个基准上提升性能。

详情

AI中文摘要

点云语义分割需要同时捕捉细粒度局部几何和广阔全局场景结构的架构。基于Transformer的网络通过聚焦于详细的局部特征聚合表现出强大性能；然而，全局上下文主要通过编码器-解码器阶段之间的跳跃连接传递，我们认为这对于完整的场景理解是不够的。我们假设，用可学习的全局特征提取模块增强跳跃连接，使网络在深入局部细节之前获取场景级知识，从而产生更丰富且更具上下文基础的表示。为此，我们提出了点Transformer与小波神经算子（PT-WNO），它在点云Transformer骨干的跳跃连接旁集成了一个共享的小波神经算子（WNO）分支。在每个编码器-解码器过渡处，点特征被投影到密集的3D体素网格上，WNO通过可学习的小波分解和重建捕获多尺度全局频谱上下文。这些全局特征通过轻量级适配器融合回网络，补充而非替代现有的跳跃连接。在四个大规模3D点云基准上的实验证明了PT-WNO的有效性。在S3DIS（Area 5）上，PT-WNO达到71.59% mIoU，比Point Transformer v3（PTv3）基线高出+1.03个百分点。在DALES上达到81.05% mIoU（比基线高+1.47）。在ScanNet v2上，PT-WNO获得76.19% mIoU，与基线（76.36%）保持竞争力。

英文摘要

Point cloud semantic segmentation requires architectures that capture both fine-grained local geometry and broad global scene structure. Transformer-based networks have demonstrated strong performance by focusing on detailed local feature aggregation; however, global context is conveyed primarily through skip connections across encoder-decoder stages, which we argue is insufficient for full scene understanding. We hypothesize that augmenting skip connections with a learnable global feature extraction module allows the network to acquire scene-level knowledge before descending into local detail, leading to richer and more contextually grounded representations. To this end, we propose Point Transformer with Wavelet Neural Operato (PT-WNO), which integrates a shared Wavelet Neural Operator (WNO) branch alongside the skip connections of a point cloud transformer backbone. At each encoder-decoder transition, point features are projected onto a dense 3D volumetric grid where the WNO captures multi-scale global spectral context through learnable wavelet decomposition and reconstruction. These global features are fused back into the network via lightweight adapters, complementing rather than replacing the existing skip connections. Experiments on four large-scale 3D point cloud benchmarks demonstrate the effectiveness of PT-WNO. On S3DIS (Area 5), PT-WNO achieves 71.59% mIoU, outperforming the Point Transformer v3 (PTv3) baseline by +1.03 points. On DALES it achieves 81.05% mIoU (+1.47 over the baseline). On ScanNet~v2, PT-WNO obtains 76.19% mIoU, remaining competitive with the baseline (76.36%).

URL PDF HTML ☆

赞 0 踩 0

2606.11464 2026-06-11 cs.RO 新提交

Bridging the sim2real gap in the table tennis robot with a transformer-based ball states predictor

基于Transformer的乒乓球状态预测器弥合仿真到现实的差距

Yin Bi, Christian Conti, Bilan Yang, Alexander Sigrist, Peter Dürr, Naoya Takahashi

发表机构 * Sony AI, Zürich, Switzerland（索尼AI，苏黎世，瑞士）； Sony AI, Tokyo, Japan（索尼AI，东京，日本）

AI总结提出基于Transformer的乒乓球状态预测框架，利用注意力机制建模长程时间依赖，结合大规模真实数据集，并引入SPAD策略替换仿真器，无需重新训练即可缩小sim2real差距。

详情

AI中文摘要

机器人乒乓球是动态环境中高速闭环机器人控制的代表性基准，其中准确快速地预测球状态对于可靠规划和控制至关重要。基于物理的方法严重依赖准确的参数识别和精确的初始状态，而基于学习的方法通常难以捕捉长程时间依赖，并且通常在有限或模拟数据上训练。我们提出了一种基于Transformer的乒乓球状态预测框架，利用注意力机制直接从历史观测中建模长程时间相关性，无需依赖显式的飞行或弹跳模型。为了支持鲁棒学习和泛化，我们从不同技能水平的球员和多种球炮配置中收集了大规模真实世界数据集。高容量Transformer架构与广泛真实世界数据的结合实现了准确的长期预测。基于此能力，我们引入了一种即插即用的仿真到现实迁移策略，即部署时交换预测器（SPAD），该策略在部署时用训练好的真实世界预测器替换训练中使用的基于物理的仿真器，从而在不需重新训练的情况下提高策略的仿真到现实迁移能力。我们证明，这种简单的替换有效地缩小了仿真到现实的差距，同时保留了基于仿真训练的效率和可扩展性。

英文摘要

Robotic table tennis is a representative benchmark for high-speed, closed-loop robotic control in dynamic environments, where accurate and fast prediction of ball states is critical for reliable planning and control. Physics-based approaches rely heavily on accurate parameter identification and precise initial state, while learning-based methods often struggle to capture long-range temporal dependencies and are typically trained on limited or simulated data. We propose a transformer-based framework for table tennis ball state prediction that leverages attention mechanisms to model long-range temporal correlations directly from historical observations, without relying on explicit flight or bounce models. To support robust learning and generalization, we collected a large-scale real-world dataset from players of varying skill levels and diverse ball cannon configurations. The combination of a high-capacity transformer architecture and extensive real-world data enables accurate long-horizon forecasting. Building on this capability, we introduce a plug-and-play sim-to-real transfer strategy, Swap Predictor at Deployment (SPAD), which replaces the physics-based simulator used during training with the proposed real-world-trained predictor at deployment, improving the sim-to-real transferability of the policy without requiring retraining. We demonstrate that this simple substitution effectively narrows the sim-to-real gap while preserving the efficiency and scalability of simulation-based training.

URL PDF HTML ☆

赞 0 踩 0

2606.11463 2026-06-11 cs.LG cs.AI 新提交

LSTM-Based Detection of Structural Breaks in Property Insurance Loss Reserving: A Climate-Informed Approach

基于LSTM的财产保险损失准备金结构性断点检测：气候信息方法

Thomas Mbrice, Shashwat Panigrahi

发表机构 * Stony Brook University（石溪大学）

AI总结针对气候变化导致传统精算方法失效的问题，提出使用LSTM神经网络检测结构性断点，在佛罗里达和路易斯安那州数据上预期将巨灾年份准备金精度提升15-20%，并给出理论保证。

Comments 15 pages, 0 figures, whitepaper YC

详情

AI中文摘要

准确的损失准备金是保险公司偿付能力的基础，然而加速的气候驱动灾难系统地违反了传统精算方法所依赖的稳定性假设。本文提出一个研究计划，测试长短期记忆（LSTM）神经网络是否能够比链梯法、Bornhuetter-Ferguson法和Cape Cod法更快、更准确地检测和适应这些结构性断点。使用来自佛罗里达州和路易斯安那州超过15年的监管发展三角形数据，并辅以NOAA飓风强度指数和海面温度，我们假设在巨灾暴露年份准备金精度有15-20%的针对性提升，这一阈值基于先前的神经网络准备金文献以及本文发展的形式化收敛结果。除了实证验证，我们还发展了一个理论框架，以概率术语为基础进行LSTM结构性断点检测，并提供形式化的性能保证，以弥补测试期间巨灾事件数量有限的不足。我们记录了研究设计、方法论、预期贡献以及对局限性的坦诚评估。

英文摘要

Accurate loss reserving is foundational to insurer solvency, yet accelerating climate driven catastrophes systematically violate the stability assumptions on which traditional actuarial methods depend. This white paper presents a research program testing whether Long Short Term Memory (LSTM) neural networks can detect and adapt to these structural breaks faster and more accurately than Chain Ladder, Bornhuetter Ferguson, and Cape Cod methods. Using 15 plus years of regulatory development triangle data from Florida and Louisiana, enriched with NOAA hurricane intensity indices and sea surface temperatures, we hypothesize a targeted improvement of 15, 20% in reserve accuracy for catastrophe exposed years, a threshold grounded both in the prior neural network reserving literature and in the formal convergence results developed here. Beyond empirical validation, we develop a theoretical framework grounding LSTM structural break detection in probabilistic terms, providing formal performance guarantees that compensate for the limited number of catastrophe events in the test period. We document the research design, methodology, expected contributions, and a candid assessment of limitations.

URL PDF HTML ☆

赞 0 踩 0

2606.11459 2026-06-11 cs.CL cs.AI cs.LG 新提交

APEX: Automated Prompt Engineering eXpert with Dynamic Data Selection

APEX: 具有动态数据选择的自动提示工程专家

Fei Wang, Si Si, Cho-Jui Hsieh, Inderjit S. Dhillon

发表机构 * Google（谷歌）； UCLA（加州大学洛杉矶分校）

AI总结提出APEX框架，通过动态数据分层（易、难、混合）优先选择高杠杆子集，在固定预算下提升提示优化效率，在三个基准上平均提升11.2%和6.8%。

详情

AI中文摘要

大型语言模型对提示表述高度敏感，需要自动提示优化以释放其全部潜力。尽管进化算法已成为主导范式，但它们面临一个关键瓶颈：数据效率。当前方法将开发数据集视为静态基准，在无信息数据上浪费大量计算预算。在这项工作中，我们引入了APEX（自动提示工程专家），这是一个新颖的框架，它在提示搜索的同时优化数据使用。APEX根据优化谱系将数据集动态分层为易、难和混合三个层级。通过优先考虑混合层级（即识别出LLM性能混合的数据），我们确定了两个高杠杆子集：用于生成信息性变异的可寻址前沿和用于区分候选质量的排名敏感前沿。我们在三个不同的基准上评估APEX：IFBench、SimpleQA Verified和FACTS Grounding。在固定5000次评估调用的预算下，由于其数据效率，APEX在Gemini 2.5 Flash上平均比初始提示高出11.2%，在Gemma 3 27B上高出6.8%，这表明以数据为中心的方法是高效且有效的提示优化的关键。

英文摘要

Large Language Models are highly sensitive to prompt formulation, necessitating automatic prompt optimization to unlock their full potential. While evolutionary algorithms have emerged as the dominant paradigm, they suffer from a critical bottleneck: data efficiency. Current methods treat the development dataset as a static benchmark, wasting significant compute budget on uninformative data. In this work, we introduce APEX (Automatic Prompt Engineering eXpert), a novel framework that optimizes the data usage alongside the prompt search. APEX dynamically stratifies the dataset into Easy, Hard, and Mixed tiers based on the optimization lineage. By prioritizing the Mixed tier, which identifies the data where the LLM has mixed performance, we identify two high-leverage subsets: the addressable frontier for generating informative mutations and the rank-sensitive frontier for distinguishing candidate quality. We evaluate APEX across three diverse benchmarks: IFBench, SimpleQA Verified, and FACTS Grounding. Under a fixed budget of 5,000 evaluation calls, due to its data efficiency, APEX outperforms the initial prompt by an average of 11.2% on Gemini 2.5 Flash and 6.8% on Gemma 3 27B, demonstrating that a data-centric approach is key to efficient and effective prompt optimization.

URL PDF HTML ☆

赞 0 踩 0

2606.11456 2026-06-11 cs.CL cs.AI cs.CY 新提交

AI Coding Agents in Social Science: Methodologically Diverse, Empirically Consistent, Interpretively Vulnerable

社会科学中的AI编码智能体：方法多样，经验一致，解释脆弱

Meysam Alizadeh, Fabrizio Gilardi, Mohsen Mosleh, Enkelejda Kasneci

发表机构 * University of Oxford（牛津大学）； University of Zurich（苏黎世大学）； Technical University of Munich（慕尼黑工业大学）

AI总结研究LLM智能体在科学分析中的方法多样性与解释脆弱性，通过20次独立实验发现智能体在设计层匹配或超越人类多样性，但在裁决层易受提示影响，偏差源于解释而非估计。

详情

AI中文摘要

基于LLM的智能体在科学分析中的部署引发了相互矛盾的担忧：智能体可能减少方法多样性，或者可能放大分析灵活性，使研究者得出动机性结论。我们认为这些担忧针对两个经验上可分离的层面：方法选择的设计层，以及决策规则将估计映射到实质性主张的裁决层。我们通过在著名的移民与社会政策问题上运行20次Claude Code和Codex的独立执行，并以多位分析师的人类基线为基准，对两者进行了测试。在设计层，Codex匹配了人类的方法多样性，而Claude Code产生了近三倍的规格；两个智能体的效应估计与人类共识大致一致，且没有智能体模型与任何人类模型完全匹配。提示诱导的反移民研究者先验重组了每个智能体的方法决策，但与同一数据中有偏见的人类分析师不同，它并未改变总体估计或最终裁决；智能体也没有沿着人类用来偏倚其估计的方法轴重新路由。在裁决层，一个明确的确认性提示将Claude Code的裁决从10%的支持率翻转为90%，同时其系数分布基本保持不变，这是通过规则省略而非规则软化实现的。AI智能体在设计层可以媲美或超越人类的方法多样性，但在裁决层仍然脆弱。在我们的设置中，AI偏差的所在不是估计而是解释。

英文摘要

The deployment of LLM-based agents in scientific analysis raises opposing concerns: that agents may reduce methodological diversity, or that they may amplify the analytic flexibility through which researchers reach motivated conclusions. We argue these worries target two empirically separable layers: a design layer of methodological choices, and a verdict layer in which a decision rule maps estimates to a substantive claim. We test both by running 20 independent executions of Claude Code and Codex on a prominent immigration and social-policy against a many-analysts human baseline. At the design layer, Codex matches human methodological diversity and Claude Code produces nearly three times as many specifications; both agents' effect estimates remain broadly aligned with the human consensus, and no agent model exactly matches any human model. A prompt-induced anti-immigration researcher prior reorganizes each agent's methodological decisions but, unlike for biased human analysts in the same data, does not shift aggregate estimates or final verdicts; nor do agents reroute along the methodological axes humans use to bias their estimates. At the verdict layer, an explicit confirmatory prompt flips Claude Code's verdicts from 10% to 90% support while leaving its coefficient distribution essentially unchanged, operating through rule omission rather than rule softening. AI agents can rival or exceed human methodological diversity at the design layer while remaining vulnerable at the verdict layer. In our setting, the locus of AI bias is not estimation but interpretation.

URL PDF HTML ☆

赞 0 踩 0

2606.11450 2026-06-11 cs.CV 新提交

Exploring Adaptive Masked Reconstruction for Self-Supervised Skeleton-Based Action Recognition

探索自适应掩码重建用于自监督基于骨架的动作识别

Shengkai Sun, Zhiyong Cheng, Zefan Zhang, Jianfeng Dong, Zhihui Li, Meng Wang

发表机构 * Hefei University of Technology（合肥工业大学）； Jilin University（吉林大学）； Zhejiang Gongshang University（浙江工商大学）； University of Science and Technology of China（中国科学技术大学）

AI总结提出自适应掩码重建（AMR）框架，通过解耦编码器-解码器并引入自适应引导模块，加速预训练并提升下游动作识别精度，在多个数据集上超越现有方法。

Comments Accepted by CVPR2026. The code is available at https://github.com/AshenOne1005/AMR

详情

AI中文摘要

最近，掩码骨架重建模型已成为强大的动作表示学习器，推动了自监督基于骨架的动作识别的重大进展。然而，现有的最先进方法必须预测极其大量的时空块，显著延长了训练时间。此外，通过在重建过程中平等对待所有时空区域，这些模型被分散了注意力，无法学习动作语义背后的关键运动模式。为了解决这些挑战，我们提出了自适应掩码重建（AMR），一个更快更强的预训练框架。我们首先将解码器与编码器解耦，使得能够灵活预测更大的时空块，并大幅降低重建复杂度。鉴于更大的块包含更复杂的信息，这难以预测并因此降低性能，我们相应地引入了一个自适应引导模块。该模块识别高运动信息量的区域，引导模型关注每个块中最具判别力的部分，并减轻重建难度。在NTU RGB+D 60、NTU RGB+D 120和PKU-MMD数据集上的实验表明，AMR不仅显著加速了预训练，还提高了下游识别精度，超越了当前最先进的方法。

英文摘要

Recently, masked skeleton reconstruction models have emerged as strong action representation learners, driving significant progress in self-supervised skeleton-based action recognition. However, existing state-of-the-art methods must predict an exceedingly large number of spatiotemporal patches, significantly prolonging training time. Besides, by treating all spatiotemporal regions equally during reconstruction, these models are distracted from learning the critical motion patterns that underlie action semantics. To address these challenges, we propose Adaptive Masked Reconstruction (AMR), a faster and stronger pre-training framework. We first decouple the decoder from the encoder, enabling flexible prediction of larger spatiotemporal patches and dramatically reducing reconstruction complexity. Given that larger patches contain more complex information, which is challenging to predict and consequently degrades performance, we accordingly introduce an adaptive guidance module. This module identifies regions of high motion informativeness, guiding the model to focus on the most discriminative parts of each patch and alleviating reconstruction difficulty. Experiments on NTU RGB+D 60, NTU RGB+D 120, and PKU-MMD datasets demonstrate that AMR not only accelerates pre-training substantially but also improves downstream recognition accuracy, surpassing current state-of-the-art approaches.

URL PDF HTML ☆

赞 0 踩 0

2606.11447 2026-06-11 cs.CL 新提交

AI Coding Agents Can Reproduce Social Science Findings

AI编码智能体能够复现社会科学研究结果

Meysam Alizadeh, Mohsen Mosleh, Fabrizio Gilardi, Atoosa Kasirzadeh, Joshua Tucker

发表机构 * University of Oxford（牛津大学）； University of Zurich（苏黎世大学）； Carnegie Mellon University（卡内基梅隆大学）； New York University（纽约大学）

AI总结本研究构建SocSci-Repro-Bench基准测试，评估Claude Code和Codex两个前沿编码智能体在221项社会科学任务中的复现能力，发现它们能复现大部分结果，且Claude Code表现更优，同时提示框架会影响确认性规范搜索。

详情

AI中文摘要

近期轶事证据表明，当提供原始数据和代码时，AI编码智能体能够复现已发表的研究结果；然而，在社会科学领域的系统评估仍然有限。现有的评估基准不足，要么规模较小，要么将智能体性能与复现材料本身的问题（如代码无法正确执行）混为一谈。本文介绍了SocSci-Repro-Bench，这是一个包含221项任务的基准测试，涵盖四个学科和13个实质性领域，这些任务基于那些结果要么完全可通过现有材料复现，要么因数据缺失而明显不可复现的研究构建，从而使我们能够隔离智能体的复现能力。评估两个前沿编码智能体Claude Code和Codex，我们发现两者都能复现大部分社会科学研究结果，其中Claude Code显著优于Codex。这些复现率远高于先前报道的通用基于LLM的智能体在类似可复现性基准上的表现。两个智能体在需要识别潜在研究问题的推理任务上也表现强劲，附加分析表明结果并非主要由记忆驱动。提供原始论文PDF与复现材料一起可适度提升性能，但在无法复现的任务上引入了偏差。我们还表明，通过微妙的提示框架，智能体可以被引导向确认性规范搜索。这些发现共同表明，至少某些前沿编码智能体可以作为计算工作流的可靠执行者，同时强调了在AI系统在科学生产中扮演更大角色时，需要仔细的基准测试和提示设计。

英文摘要

Recent anecdotal evidence suggests that AI coding agents can reproduce published findings when provided with original data and code; yet systematic evaluation across social sciences remains limited. Existing evaluation benchmarks are insufficient, either small or conflate agent performance with problems in the reproduction materials themselves, such as code that fails to execute correctly. Here we introduce SocSci-Repro-Bench, a benchmark of 221 tasks spanning four disciplines and 13 substantive domains, constructed from studies whose results are either fully reproducible with available materials or demonstrably non-reproducible due to missing data, allowing us to isolate agents' reproduction capacity. Evaluating two frontier coding agents, Claude Code and Codex, we find that both can reproduce a large share of social science findings, with Claude Code substantially outperforming Codex. These reproduction rates considerably exceed those previously reported for general-purpose LLM-based agents on comparable reproducibility benchmarks. Both agents also perform strongly on a reasoning task requiring identification of underlying research questions, and additional analyses suggest that results are not primarily driven by memorization. Providing the original paper PDF alongside replication materials modestly improves performance but introduces bias on tasks where reproduction is impossible. We also show that agents can be nudged toward confirmatory specification search through subtle prompt framing. Together, these findings suggest that at least some frontier coding agents can serve as reliable executors of computational workflows while underscoring the need for careful benchmarking and prompt design as AI systems assume larger roles in scientific production.

URL PDF HTML ☆

赞 0 踩 0

2606.11446 2026-06-11 cs.CV cs.GR 新提交

3D-CBM: A Framework for Concept-Based Interpretability in Generative 3D Modeling

3D-CBM：生成式3D建模中基于概念可解释性的框架

Ahmad Al-Kabbany

发表机构 * Yubree Labs ； Multimedia Interaction and Communication Lab, Arab Academy for Science and Technology（阿拉伯科学技术学院多媒体交互与通信实验室）

AI总结提出将概念瓶颈模型（CBM）融入3D生成架构，通过多层级可解释原语和功能属性映射，实现语义可操控的3D生成，实验验证了高概念预测精度和交互式纠错能力。

详情

AI中文摘要

本研究引入了一个将概念瓶颈模型（CBM）融入3D生成架构的框架，以解决深度几何学习中固有的“语义鸿沟”。随着深度模型成为3D内容创建的核心，可解释性从边缘特性转变为医疗和制造等安全关键领域中信任和问责的基本要求。CBM通过约束潜在表示与人类定义的概念对齐，提供了一种内在的可解释性解决方案，但其在非结构化3D数据上的应用仍 largely unexplored。我们设计、实现并验证了一个正式的3D-CBM架构，将原始几何输入（包括点云和网格）映射到可解释基元和功能属性的多层级分类中。该框架进一步确定了专门用于基于概念监督的战略性数据集，如PartNet和ShapeNet。来自3D部件操作概念验证实验的结果证明了该框架的有效性，实现了88.8%的概念预测准确率和0.0115的Chamfer距离。关键的是，该模型支持精确的测试时干预，允许交互式纠正结构错误。这项工作为语义可操控的3D生成奠定了基础，并邀请进一步探索协作式人在回路设计系统。

英文摘要

This research introduces a framework for incorporating Concept Bottleneck Models (CBMs) into 3D generative architectures to address the inherent 'semantic gap' in deep geometric learning. As deep models become central to 3D content creation, explainability shifts from a peripheral feature to a fundamental requirement for trust and accountability in safety-critical domains such as healthcare and manufacturing. CBMs provide an intrinsic interpretability solution by constraining latent representations to align with human-defined concepts, yet their application to unstructured 3D data remains largely unexplored. We design, implement, and validate a formal 3D-CBM architecture that maps raw geometric inputs, including point clouds and meshes, into a multi-tiered taxonomy of interpretable primitives and functional attributes. The framework further identifies strategic datasets, such as PartNet and ShapeNet, specialized for concept-based supervision. Experimental results from a 3D part-manipulation proof-of-concept experiment demonstrate the framework's efficacy, achieving a concept prediction accuracy of 88.8\% and a Chamfer Distance of 0.0115. Critically, the model enables precise test-time intervention, allowing for the interactive correction of structural errors. This work establishes a foundation for semantically-steerable 3D generation and invites further exploration into collaborative human-in-the-loop design systems.

URL PDF HTML ☆

赞 0 踩 0

2606.11445 2026-06-11 cs.AI 新提交

Forecasting Future Behavior as a Learning Task

将未来行为预测作为学习任务

Mosh Levy, Yoav Goldberg, Asa Cooper Stickland

发表机构 * Bar-Ilan University（巴伊兰大学）； Allen Institute for AI（艾伦人工智能研究所）； UK AI Security Institute（英国人工智能安全研究所）

AI总结提出将AI行为预测作为可学习任务，训练行为预测器从推理轨迹中预测未来行为，无需解释步骤，在两项任务上优于GPT-5.4和Claude Opus-4.6。

详情

AI中文摘要

对AI系统的信任通常基于对其工作原理的解释，人们利用这些解释来预测系统在新输入上的行为。对于大型推理模型（LRM），这条常规路径尤其难以遵循：针对单个token生成的解释方法无法自然推广到长轨迹，而轨迹本身在作为自然语言阅读时往往不忠实。我们提出一种绕过解释步骤的替代方案：将行为预测视为可学习任务，训练行为预测器（Behavior Forecasters）在单个推理轨迹上运行，以做出通常从解释中寻求的相同预测。预测器的训练数据通过查询LRM获得，无需人工标注，其推理在单次前向传播中完成。我们在两个任务上实例化该方法：LRM在重新运行时重复其答案的可能性，以及移除输入部分如何改变其答案。我们在三个不同的推理数据集上对这两个任务进行了评估，发现训练后的行为预测器比作为朴素读者阅读相同轨迹的GPT-5.4和Claude Opus-4.6更准确，而推理成本仅为其一小部分。我们发现，端到端微调骨干网络并从目标LRM初始化对于强性能都是必要的。这些结果表明，推理轨迹携带了关于LRM未来行为的信息，超出了朴素阅读所能传达的范围。

英文摘要

Trust in an AI system is often anchored by explanations of how it works, which one then uses to forecast its behavior on new inputs. For large reasoning models (LRMs), this conventional route is particularly difficult to follow: explanation methods for single token generations do not naturally generalize to long trajectories, and the trajectories themselves are often not faithful when read as natural language. We propose an alternative that bypasses the explanation step: treat behavior forecasting as a learnable task and train Behavior Forecasters that operates on a single reasoning trajectory to make the same forecasts one would typically seek from an explanation. The forecaster's training data is obtained by querying the LRM with no human annotation, and its inference is done in a single forward pass. We instantiate this approach on two tasks: how likely the LRM is to repeat its answer on re-runs, and how removing parts of the input changes its answer. We evaluate this approach on both tasks across three diverse reasoning datasets and find that trained Behavior Forecasters are more accurate than GPT-5.4 and Claude Opus-4.6 reading the same trajectories as naive readers, at a small fraction of their inference cost. We find that fine-tuning the backbone end-to-end and initializing it from the target LRM are each necessary for strong performance. These results show that the reasoning trajectory carries information about the LRM's future behavior that goes beyond what naive reading conveys.

URL PDF HTML ☆

赞 0 踩 0

2606.11440 2026-06-11 cs.AI 新提交

INFRAMIND: Infrastructure-Aware Multi-Agent Orchestration

INFRAMIND: 基础设施感知的多智能体编排

Ahasan Kabir, Jiaqi Xue, Mengxin Zheng, Qian Lou

发表机构 * University of Central Florida（中佛罗里达大学）

AI总结提出INFRAMIND框架，通过强化学习将基础设施状态（队列深度、KV缓存压力等）融入多智能体LLM编排的规划、路由和调度决策，在共享GPU集群上实现质量与延迟的平衡，相比基线提升最高7.6%准确率并降低7倍延迟。

Comments Preprint

详情

AI中文摘要

现有的多智能体LLM编排方法，从暴力集成到学习型路由器，基于任务和模型特征选择模型和拓扑。然而，这些方法不考虑服务基础设施的运行时状态。在共享GPU集群上并发负载下，这种基础设施盲区导致系统性的资源利用不足：首选模型积累深度请求队列，而同等能力的替代模型闲置。在多智能体流水线中，每个查询触发多个顺序模型调用，这些延迟会进一步累积到每个下游步骤。弥补这一差距具有挑战性，因为相关基础设施信号（队列深度、KV缓存压力、延迟）是动态且嘈杂的，并且它们必须驱动三个不同的决策：规划、逐步骤路由和调度。我们引入INFRAMIND，一个使整个多智能体堆栈具备基础设施感知的框架。一个基础设施感知的规划器根据实时系统负载和剩余预算调节拓扑和角色选择，在拥塞时偏向简单图，在低负载时偏向丰富图。然后，一个基础设施感知的执行器在每个智能体步骤观察每个模型的队列深度、缓存利用率和响应延迟，以决定调用哪个模型以及推理深度；一个预算感知的调度器进一步重新排序每个模型的队列，使紧急请求优先得到服务。将其建模为分层约束MDP并通过强化学习端到端求解，系统自动学习平衡质量与延迟。在五个基准测试中，INFRAMIND在低负载下相比先前基线准确率提升高达7.6个百分点，延迟降低7倍，在高负载下维持高达99.9%的SLO合规性，而所有基线均降至50%以下。

英文摘要

Existing multi-agent LLM orchestration methods, ranging from brute-force ensembles to learned routers, select models and topologies based on task and model features. However, these methods do not consider the runtime state of the serving infrastructure. On shared GPU clusters under concurrent load, this infrastructure blindness causes systematic resource underutilization: preferred models accumulate deep request queues while equally capable alternatives sit idle. In multi-agent pipelines, where each query triggers multiple sequential model calls, these delays then compound across every downstream step. Closing this gap is challenging because the relevant infrastructure signals (queue depths, KV-cache pressure, latencies) are dynamic and noisy, and they must drive three different decisions: planning, per-step routing, and scheduling. We introduce INFRAMIND, a framework that makes the entire multi-agent stack infrastructure-aware. An infra-aware planner conditions topology and role selection on real-time system load and remaining budget, biasing toward simpler graphs under congestion and richer ones at low load. An infra-aware executor then observes per-model queue depths, cache utilization, and response latencies at each agent step to decide which model to call and how deeply to reason; a budget-aware scheduler further reorders each model's queue so that urgent requests are served first. Cast as a hierarchical constrained MDP and solved end-to-end via reinforcement learning, the system learns to balance quality against latency automatically. Across five benchmarks, INFRAMIND delivers up to +7.6 pp accuracy over the prior baseline at low load with up to 7x lower latency, and sustains up to 99.9% SLO compliance under high load where every baseline drops below 50%.

URL PDF HTML ☆

赞 0 踩 0

2606.11435 2026-06-11 cs.CL 新提交

Agent Skill Evaluation and Evolution: Frameworks and Benchmarks

智能体技能评估与进化：框架与基准

Kexin Ding, Yang Zhou, Can Jin, Feng Tong, Mu Zhou, Dimitris N. Metaxas

发表机构 * Rutgers University（罗格斯大学）； University of North Carolina at Charlotte（北卡罗来纳大学夏洛特分校）

AI总结本文系统综述了智能体技能从孤立创建到自动化评估驱动进化的范式转变，分类了四种进化范式并分析了六个技能基准类别，指出了覆盖缺口和开放方向。

详情

AI中文摘要

智能体技能的增长已经改变了智能体系统的构建、评估和部署方式。随着技能库的持续扩展，严格的评估对于确保其在现实应用中的效用、质量和安全性变得至关重要。因此，该领域正在经历从孤立技能创建到自动化、评估驱动的技能进化的新兴范式转变。在本综述中，我们系统地考察了超越基础技能创建的技能进化与评估的格局。我们将进化分为四种不同的范式，涵盖执行反馈、轨迹蒸馏、压缩和强化学习，展示了每种元素如何有助于提高技能效用和可靠性。我们还对六个以技能为中心的基准类别进行了分析，识别了基准覆盖范围、权衡和度量丰富性方面的结构性差距，以推动技能研究。最后，我们指出了构建可泛化、高效且可验证安全的技能生态系统的开放方向。项目网址为：https://this https URL

英文摘要

The growth of agent skills has transformed how agentic systems are built, evaluated, and deployed. As skill libraries continue to scale, rigorous evaluation becomes critical to ensuring their utility, quality, and safety in real-world applications. Consequently, the field is undergoing an emerging paradigm shift from isolated skill creation to automated, evaluation-driven skill evolution. In this survey, we systematically examine the landscape of skill evolution and evaluation beyond foundational skill creation. We categorize evolution into four distinct paradigms, spanning execution feedback, trajectory distillation, compression, and reinforcement learning, showing how each element contributes to improving skill utility and reliability. We also provide an analysis of six skill-centric benchmark categories, identifying structural gaps in benchmark coverage, trade-offs, and metric richness to advance skill research. Finally, we identify open directions for building skill ecosystems that are generalizable, efficient, and verifiably safe. The project URL is https://github.com/Cassie07/AgentSkill_Survey

URL PDF HTML ☆

赞 0 踩 0

2606.11431 2026-06-11 cs.LG 新提交

Mirror Descent Beyond Euclidean Stability: An Exponential Separation in Initialization Sensitivity

超越欧几里得稳定性的镜像下降：初始化敏感性的指数级分离

Shira Vansover-Hager, Matan Schliserman, Ofir Schlisselberg, Tomer Koren

发表机构 * Blavatnik School of Computer Science and AI, Tel Aviv University（特拉维夫大学布拉瓦特尼克计算机科学与人工智能学院）； Google Research（谷歌研究院）

AI总结本文证明非二次正则化的镜像下降（MD）在凸光滑目标上对初始化的敏感性可呈指数级增长，与梯度下降（GD）形成鲜明对比，并提出基于锚点的Bregman正则化可缓解不稳定性。

详情

AI中文摘要

镜像下降（MD）将梯度下降（GD）扩展到欧几里得几何之外，最近重新成为强化学习和LLM后训练中KL正则化策略优化的视角。这引发了一个基本的鲁棒性问题，对可重复性和可靠性至关重要：MD动力学对其输入的敏感性如何？我们关注初始化，它本身通常是预训练或先前对齐的模型。众所周知，二次正则化的MD（包括GD和马氏几何）对于凸光滑目标是稳定的。我们展示了一个鲜明的对比：一旦正则化器是非二次的，即使正则化器在欧几里得范数下是良条件的，MD对初始化的敏感性也可能比GD高指数级。我们给出了一个三维构造，其中目标函数是凸光滑的，正则化器是强凸、光滑且良条件的，初始$\varepsilon$扰动在$T$次步长为$\eta$的MD迭代后迅速放大到$\min\{\text{polylog}^{-1}(1/\varepsilon), \varepsilon e^{\Omega(\eta T)}\}$。对于单纯形上的典型KL正则化MD，我们证明即使线性目标在高维或近边界区域也能指数级放大初始$\varepsilon$扰动。最后，我们展示向锚点添加Bregman正则化项可以在很大程度上保持优化保证的同时稳定动力学，并且锚点的选择至关重要：在初始化处锚定仅部分缓解不稳定性，而在固定点锚定则产生更稳定的机制。

英文摘要

Mirror Descent (MD) extends Gradient Descent (GD) beyond Euclidean geometry and has recently reappeared as a lens for KL-regularized policy optimization in reinforcement learning and LLM post-training. This raises a basic robustness question, crucial to reproducibility and reliability: how sensitive are MD dynamics to their inputs? We focus on initialization, often itself a pretrained or previously aligned model. Quadratic-regularized MD, including GD and Mahalanobis geometries, is well-known to be stable for convex smooth objectives. We show a sharp contrast: once the regularizer is non-quadratic, MD can be exponentially more sensitive to initialization than GD, even with a well-conditioned regularizer in Euclidean norm. We give a three-dimensional construction with a convex, smooth objective and a strongly convex, smooth, well-conditioned regularizer where an initial $\varepsilon$ perturbation is quickly amplified to $\min\{\text{polylog}^{-1}(1/\varepsilon), \varepsilon e^{Ω(ηT)}\}$ after $T$ iterations of MD with step size $η$. For canonical KL-regularized MD on the simplex, we show that even linear objectives can amplify an initial $\varepsilon$ perturbation exponentially fast in high-dimensional or near-boundary regimes. Finally, we show that adding a Bregman regularization term toward an anchor point can stabilize the dynamics while largely preserving the optimization guarantees, and that the choice of anchor is crucial: anchoring at the initialization only partially mitigates the instability, whereas anchoring at a fixed point yields a more stable mechanism.

URL PDF HTML ☆

赞 0 踩 0

2606.11424 2026-06-11 cs.CL 新提交

SOMA-SQL: Resolving Multi-Source Ambiguity in NL-to-SQL via Synthetic Log and Execution Probing

SOMA-SQL: 通过合成日志和执行探测解决NL-to-SQL中的多源歧义

Sai Ashish Somayajula, Marianne Menglin Liu, Chuan Lei, Fjona Parllaku, Daniel Garcia, Rongguang Wang, Syed Fahad Allam Shah, Ankan Bansal, Sujeeth Bharadwaj, Tao Sheng, Sujith Ravi, Dan Roth

发表机构 * Oracle AI（甲骨文人工智能实验室）

AI总结提出SOMA-SQL框架，通过合成查询日志和歧义驱动探测自动解决自然语言到SQL中的多源歧义，在6个基准上平均执行准确率提升13.0%。

Comments 34 pages, 1 figure, 7 tables. Preprint

详情

AI中文摘要

自然语言数据库接口旨在将用户问题转换为可执行的SQL，但在现实环境中，问题表述不明确且模式庞大且模糊时仍然脆弱。用户问题、数据库模式和模型解释之间的歧义是NL2SQL中的主要失败模式，导致意图不匹配、模式接地错误和SQL生成错误。现有方法依赖人工澄清或将歧义视为模式表示问题，但这些方法无法扩展也无法自主解决歧义。我们提出SOMA-SQL，通过目标合成查询日志和歧义驱动探测自动解决歧义。SOMA-SQL构建合成查询日志以接地模式解释并指导候选SQL生成；然后执行目标探测查询，由结构化歧义分类和候选不一致驱动，为最终SQL选择和修复生成消歧证据。这种主动的歧义发现和解决方法无需人工参与即可泛化到未见过的模式和查询分布。在六个公开基准上的实验表明，SOMA-SQL相比最先进的基线平均执行准确率提升13.0%，在歧义问题上提升高达16.7%。

英文摘要

Natural language interfaces to databases aim to translate user questions into executable SQL, yet remain brittle in real-world settings where questions are underspecified and schemas are large and ambiguous. Ambiguity across user questions, database schemas, and model interpretations are central failure modes in NL2SQL, leading to misaligned intent, incorrect schema grounding, and erroneous SQL generation. Existing approaches rely on human clarification or treat ambiguity as a schema representation problem, but these do not scale nor resolve ambiguity autonomously. We propose SOMA-SQL to automatically resolve ambiguity via targeted synthetic query log and ambiguity-driven probing. SOMA-SQL constructs synthetic query log to ground schema interpretation and guide candidate SQL generation; it then executes targeted probing queries, driven by a structured ambiguity taxonomy and candidate disagreements, to produce disambiguation evidence for final SQL selection and repair. This active approach to ambiguity discovery and resolution generalizes across unseen schemas and query distributions without human-in-the-loop. Experiments on six public benchmarks demonstrate that SOMA-SQL improves execution accuracy by 13.0% on average over state-of-the-art baselines, with gains of up to 16.7% on ambiguous questions.

URL PDF HTML ☆

赞 0 踩 0

2606.11420 2026-06-11 cs.CL cs.SI 新提交

Context-Aware Multimodal Claim Verification in Spoken Dialogues

口语对话中的上下文感知多模态声明验证

Chaewan Chun, Delvin Ce Zhang, Dongwon Lee

发表机构 * The Pennsylvania State University（宾夕法尼亚州立大学）； University of Sheffield（谢菲尔德大学）

AI总结提出MAD2基准和上下文感知多模态融合方法，验证对话音频中的声明，发现对话结构比虚假信息框架对验证更重要。

详情

AI中文摘要

每天，数百万人从播客和流媒体中吸收声明，而这些声明从未被事实核查员看到。口语错误信息是通过对话构建的，其中可信度不仅来自事实本身，还来自声明如何在对话轮次中被构建、强化或未被质疑。然而，事实核查一直专注于孤立的文本，对话音频研究不足。我们引入了MAD2，一个新的用于口语声明验证的多轮音频对话基准，包含1,000个双说话者对话，3,368个值得核查的声明和约10小时的音频，并提出了上下文感知音频编码器和对话感知文本模型的校准多模态融合。在各种设置下，添加对话上下文改善了验证，但收益取决于场景类型。仅使用前文上下文通常与离线性能相当，支持实时审核设置，而当基于转录的模型被额外上下文 destabilized 时，音频贡献最大。总体而言，对话结构对验证的影响比错误信息框架更大。

英文摘要

Every day, millions absorb claims from podcasts and streams that no fact-checker ever sees. Spoken misinformation is built through conversation, where credibility comes not from facts alone but from how claims are framed, reinforced, or left unchallenged across turns. Yet fact-checking has focused on isolated text, leaving dialogue audio under-studied. We introduce MAD2, a new Multi-turn Audio Dialogues benchmark for spoken claim verification, containing 1,000 two-speaker dialogues with 3,368 check-worthy claims and approximately 10 hours of audio, and propose calibrated multimodal fusion of a context-aware audio encoder and a dialogue-aware text model. Across settings, adding dialogue context improves verification, but the gains depend on scenario type. Using only preceding context often matches offline performance, supporting live-moderation settings, and audio contributes most when transcript-based models are destabilized by additional context. Overall, conversational structure matters more for verification than misinformation framing.

URL PDF HTML ☆

赞 0 踩 0

2606.11419 2026-06-11 cs.RO 新提交

A Modular Dual-Camera Pipeline for Micro-Inspection Using Aerial Robots

一种用于空中机器人微检测的模块化双相机流水线

S. H. Mirtajadini, N. Rublein, R. M. Ramakrishnan, G. ter Maat, M. Aldibaja, A. Y. Mersha

发表机构 * Netherlands Organization for Scientific Research (NWO)（荷兰科学研究组织）； Saxion University of Applied Sciences（萨克逊应用科学大学）

AI总结提出一种模块化双相机空中微检测流水线，通过变焦云台相机和广角立体导航相机协同工作，结合视觉反馈回路，实现对树木和温室粘虫板等非结构目标的鲁棒微检测。

详情

AI中文摘要

现有大多数基于无人机的检测系统要求无人机危险地靠近目标飞行或遵循复杂飞行路径以捕获小细节。此外，无人机飞行受干扰和定位不准确的影响，当视野狭窄时可能导致无人机丢失目标。此外，轨迹规划通常需要目标几何、位置和方向的先验信息，这对于非结构目标（如树木、车辆或人员）并不总是可用。为解决这些挑战，本文提出了 aerial_micro_inspection，一种适用于不同用例的通用空中微检测流水线。该流水线假设一架搭载PX4的无人机配备两个摄像头：(i) 一个变焦云台检测摄像头，无需无人机飞得离目标很近即可捕获精细细节；(ii) 一个宽视场立体导航摄像头，现场获取目标表面，估计其距离，并将其分割成较小的检测区域。此外，当检测摄像头访问较大表面的小分区时，基于视觉的反馈回路补偿无人机运动。我们在仿真和真实实验中评估了该流水线，主要在两种用例场景中：用于检测橡树行军虫及其卵的树木检测，以及用于检测粉虱的温室粘虫板检测。结果显示，在仿真中，无人机干扰下的覆盖鲁棒性得到改善，在真实实验中，有效检测了幼虫和卵，并对昆虫进行了高细节成像。该流水线是开源的，基于ROS 2开发，可通过替换表面分割和微目标检测检查点来适应新应用。代码见：this https URL

英文摘要

Most existing drone-based inspection systems require the drone to fly dangerously close to the target or follow complex flight paths to capture small details. In addition, drone flight is affected by disturbances and localization inaccuracies, which can cause the drone to lose sight of its supposed target when it has a narrow view. Furthermore, trajectory planning often requires prior information about the target's geometry, position, and orientation, which is not always available for non-structural targets such as trees, vehicles, or people. To address these challenges, this paper presents aerial_micro_inspection, a generic pipeline for aerial micro-inspection across different use cases. The pipeline assumes a PX4-powered drone equipped with two cameras: (i) a zoomed, gimbal-mounted inspection camera that captures fine details without requiring the drone to fly very close to the target, and (ii) a wide-field-of-view stereo navigation camera that acquires the target surface on site, estimates its range, and partitions it into smaller inspection regions. In addition, a vision-based feedback loop compensates for drone motion while the inspection camera visits small partitions of a larger surface. We evaluate the pipeline in simulation and real-world experiments, mainly in two use-case scenarios: tree inspection for detecting oak processionary caterpillars and their eggs, and greenhouse inspection of sticky traps for detecting whiteflies. The results show improved coverage robustness under drone disturbances in simulation, as well as effective detection of caterpillars and eggs and high-detail imaging of insects in real-world experiments. The pipeline is open-source, developed in ROS 2, and can be adapted to new applications by replacing the surface-segmentation and micro-target detection checkpoints. The code is available at: https://github.com/SaxionMechatronics/aerial_micro_inspection

URL PDF HTML ☆

赞 0 踩 0

2606.11417 2026-06-11 cs.LG cs.AI stat.ML 新提交

Signed Compression Progress on a Sealed Audit is Goodhart-Resistant

密封审计上的有符号压缩进展是古德哈特抵抗的

Ayush Mittal, Dhruv Gupta

发表机构 * GitHub

AI总结提出有符号压缩进展作为内在动机，证明其累积奖励等于审计改进，且对有限审计面板具有假阳性预算，抵抗古德哈特定律。

Comments 16 pages, 7 figures. Lean 4 (Mathlib) mechanized core and ARC-TGI experiment code: https://github.com/Zetetic-Dhruv/audit-compression-progress

详情

AI中文摘要

压缩进展是一个长期提出的内在动机方案：当智能体的世界模型在预测或压缩经验方面变得更好时给予奖励。民间声称这种奖励是“可信的”，因为它只在学习时支付。我们使这一点精确化并证明它。如果内在奖励是固定密封审计损失的有符号减少，即 r_t = E(theta_{t-1}) - E(theta_t)，那么累积奖励恰好望远镜式地归结为端点审计改进，因此没有策略可以在真实审计性能停滞或下降时无限推高奖励。对于有限审计面板，同样的结果成立，并带有尖锐的假阳性预算：累积经验奖励最多为真实审计改进加上 2 Delta_n(F, delta)，即模型类的均匀审计偏差。这是无水平依赖的：一旦密封面板均匀控制该类，随时间变化的适应性无需付出代价。该定理还识别了失败模式：如果进展被截断、在智能体自身流上评分、暴露于可重用面板上的高容量模型，或应用于使 Delta_n 无效的神经类，则保证消失。我们给出了结构核心（望远镜式、有限审计界、有限吉布斯和熵下限）的 Lean 4 机械化，以及在 ARC-TGI 网格变换生成器上带有自适应保留攻击的实验套件。实验证实了理论：有限审计偏差按 n^{-0.527} 缩放；有符号进展抵抗截断农场、流泄漏和噪声电视好奇心；朴素的可重用审计可被黑盒标量反馈利用，而标准发布防御将攻击保持在 2 Delta_n 阈值以下。密封审计上的有符号压缩进展是真正改进的会计信号。

英文摘要

Compression progress is a long-standing proposal for intrinsic motivation: reward an agent when its world model becomes better at predicting or compressing experience. The folk claim is that this reward is "credible" because it is paid only for learning. We make this precise and prove it. If intrinsic reward is the signed decrease of a fixed sealed-audit loss, r_t = E(theta_{t-1}) - E(theta_t), then cumulative reward telescopes exactly to endpoint audit improvement, so no policy can push reward up indefinitely while true audit performance stagnates or degrades. For finite audit panels the same result holds with a sharp false-positive budget: cumulative empirical reward is at most true audit improvement plus 2 Delta_n(F, delta), the uniform audit deviation of the model class. This is horizon-free: adaptivity over time costs nothing once the sealed panel uniformly controls the class. The theorem also identifies the failure modes: the guarantee disappears if progress is clipped, scored on the agent's own stream, exposed to a high-capacity model on a reusable panel, or applied to a neural class that makes Delta_n vacuous. We give a Lean 4 mechanization of the structural core (telescoping, the finite-audit bound, finite Gibbs, and the entropy floor) and an experiment suite on ARC-TGI grid-transformation generators with adaptive holdout attacks. Experiments confirm the theory: finite-audit deviation scales as n^{-0.527}; signed progress resists clip-farming, stream leakage, and noisy-TV curiosity; naive reusable audits are exploitable by black-box scalar feedback, while standard release defenses keep the attack below the 2 Delta_n threshold. Signed compression progress on a sealed audit is an accounting signal of genuine improvement.

URL PDF HTML ☆

赞 0 踩 0

2606.11409 2026-06-11 cs.LG cs.AI cs.CR 新提交

Risk Under Pressure: Compute-Aware Evaluation of Adversarial Robustness in Language Models

压力下的风险：语言模型对抗鲁棒性的计算感知评估

Malikeh Ehghaghi, Boglárka Ecsedi, Marsha Chechik, Colin Raffel

发表机构 * University of Toronto（多伦多大学）； Vector Institute（向量研究所）； Hugging Face

AI总结提出基于计算压力（累积FLOPs）的对抗鲁棒性评估框架，通过风险-计算曲线和两个新指标，揭示不同攻击策略的计算成本差异，并在10个模型上验证了对齐训练、模型规模等因素对计算空间鲁棒性的非单调影响。

详情

AI中文摘要

大型语言模型（LLMs）的对抗鲁棒性评估通常报告固定查询预算下的攻击成功率（ASR），隐含地认为所有攻击成本相同。实际上，不同攻击策略的计算开销可能相差几个数量级。因此，固定预算下的ASR可能掩盖破解模型所需的真实努力，从而难以判断攻击成本是否值得。我们提出一个基于计算压力的计算感知评估框架，以累积浮点运算次数（FLOPs）作为对抗努力的代理。我们引入风险-计算曲线，将计算预算映射到攻击风险，并推导出两个指标，总结给定攻击成功所需的平均压力。在跨越三个模型家族和语言模型训练与对齐的四个不同阶段的十个模型上，使用三种攻击策略（基于梯度、迭代细化和基于模板）在两个破解鲁棒性基准上评估，我们发现：（1）对齐训练对计算空间鲁棒性具有非单调影响；（2）扩大模型规模降低了基于梯度的攻击有效性，但对更便宜的基于模板的攻击影响有限；（3）在代理模型上优化的基于梯度的攻击可以迁移到独立的目标模型，从而降低攻击者成本；（4）在单个模型内，不同危害类别的计算成本差异高达约5倍；（5）安全对齐的RL增加了总成本，同时使某些类别不成比例地易于攻击。我们发布框架以实现计算感知的风险评估和评价。

英文摘要

Adversarial robustness evaluations of large language models (LLMs) typically report attack success rate (ASR) under fixed query budgets, implicitly treating all attacks as equally costly. In practice, the computational expense of different attack strategies can vary by orders of magnitude. Consequently, ASR at a fixed budget can obscure the true effort required to jailbreak a model, thereby making it hard to determine whether an attack's cost justifies its payoff to the attacker. We propose a compute-aware evaluation framework based on computational pressure, measured in cumulative floating-point operations (FLOPs), as a proxy for adversarial effort. We introduce risk-compute curves, which map compute budgets to attack risk, and derive two metrics that summarize the average pressure required for a given attack to succeed. Across ten models spanning three families and four different stages in language model training and alignment, evaluated with three attack strategies (gradient-based, iterative refinement, and template-based) on two jailbreak robustness benchmarks, we find: (1) alignment training has non-monotonic effects on compute-space robustness; (2) scaling model size reduces gradient-based attack effectiveness but has limited impact on cheaper template-based attacks; (3) gradient-based attacks optimized on a surrogate model can transfer to a separate target model, providing a way to reduce attacker costs; (4) compute cost varies by up to ${\approx}5{\times}$ across harm categories within a single model; and (5) safety-aligned RL increases aggregate cost while leaving some categories disproportionately accessible. We release our framework to enable compute-aware risk assessment and evaluation.

URL PDF HTML ☆

赞 0 踩 0

2606.11408 2026-06-11 cs.RO 新提交

Dynamic Execution Horizon Prediction for Chunk-based Robot Policies

基于分块的机器人策略的动态执行视界预测

Yuchi Zhao, Miroslav Bogdanovic, Arjun Sohal, Liyu Tao, Kourosh Darvish, Alán Aspuru-Guzik, Florian Shkurti, Animesh Garg

发表机构 * University of Toronto（多伦多大学）； Vector Institute for Artificial Intelligence（向量人工智能研究所）； Acceleration Consortium（加速联盟）； Canadian Institute for Advanced Research (CIFAR)（加拿大高等研究院）； Georgia Institute of Technology（佐治亚理工学院）； NVIDIA（英伟达）

AI总结提出DEHP方法，通过在线强化学习训练轻量级执行视界预测分支，在冻结预训练分块策略的情况下动态调整执行步数，显著提升高精度和长时域操作任务的成功率。

详情

AI中文摘要

动作分块已成为现代机器人策略中的标准设计，从扩散/流策略到视觉-语言-动作模型，策略预测一系列动作并执行固定数量的动作，而不是一次一步地行动。然而，这种范式依赖于一个关键假设：固定的执行视界。在分块执行期间，策略以开环方式运行，这对于需要频繁重新规划的精细操作任务尤其成问题。在实践中，执行视界通常通过经验调整来选择，并且高度依赖于任务。为此，我们提出了动态执行视界预测（DEHP），一种有效的方法，它使用在线强化学习训练一个轻量级的执行视界预测分支，同时完全冻结预训练的分块策略。这使得该方法与黑盒分块策略兼容，并将适应执行视界的效果与底层动作生成器的变化隔离开来。在我们的评估中，DEHP大幅提高了不同高精度和长时域操作任务的成功率。我们的定性分析进一步表明，DEHP在任务的精细阶段预测较短的执行视界，在自由空间运动中预测较长的视界。通过这种方式，DEHP平衡了开环分块执行的效率与闭环单步控制的反应性。项目页面：此 https URL

英文摘要

Action chunking has become a standard design in modern robot policies, from diffusion/flow policies to vision-language-action models, where the policy predicts a sequence of actions and executes a fixed number of them instead of acting one step at a time. However, this paradigm relies on a key assumption: a fixed execution horizon. During chunk execution, the policy operates open-loop, which is particularly problematic for fine-grained manipulation tasks that require frequent replanning. In practice, the execution horizon is typically chosen through empirical tuning and is highly task-dependent. To this end, we propose Dynamic Execution Horizon Prediction (DEHP), an effective method that trains a lightweight execution-horizon prediction branch using online reinforcement learning while keeping the pretrained chunk policy completely frozen. This makes the method compatible with black-box chunk policies and isolates the effect of adapting the execution horizon from changes to the underlying action generator. Across our evaluations, DEHP improves the success rate of different high-precision and long-horizon manipulation tasks by a large margin. Our qualitative analysis further shows that DEHP predicts shorter execution horizons during fine-grained stages of the task and longer horizons during free-space motion. In this way, DEHP balances the efficiency of open-loop chunk execution with the reactivity of closed-loop single-step control. Project page: https://dehp-chunking.github.io/

URL PDF HTML ☆

赞 0 踩 0

2606.11400 2026-06-11 cs.SD cs.AI eess.AS 新提交

Steering Where to Listen: Instruction-Based Activation Steering Redirects Temporal Attention in Large Audio-Language Models

引导听哪里：基于指令的激活操控重定向大型音频语言模型中的时间注意力

Tsung-En Lin, Hung-Yi Lee

发表机构 * National Taiwan University（国立台湾大学）； NTU Artificial Intelligence Center of Research Excellence (NTU AI-CoRE)（国立清华大学人工智能研究中心（NTU AI-CoRE））

AI总结提出基于指令的向量操控方法，通过对比不同指令下的激活来重定向音频令牌的时间注意力，实现无需训练的声音事件定位，显著优于直接提示和随机基线。

详情

AI中文摘要

大型音频语言模型（LALMs）在音频理解方面表现出色，但很少揭示它们关注音频信号的哪个部分。我们引入了基于指令的向量操控，该方法通过对比不同指令提示下的激活来构建操控向量，同时保持音频不变。通过对LALM注意力的系统探测，我们发现——与标准提示或基于音频的操控不同——这种干预显著重新分配了分配给音频令牌的时间注意力，将其集中在声学相关的区域。然后我们展示了这种注意力转移在行为上是有意义的：在受控的三事件设置中，读取由操控引起的最大注意力变化的时间位置，可以恢复查询声音事件的位置，而无需任何训练，在Qwen2-Audio和Audio Flamingo 3上分别达到60.87%和68.72%与真实区间的重叠，远高于直接提示（31.84%，46.75%）和随机基线（27.74%）。我们的结果表征了LALMs中基于指令的操控的机制特性，并为这些模型编码的潜在时间结构提供了一种无需训练的探测方法。

英文摘要

Large Audio-Language Models (LALMs) excel at audio understanding but expose little about where in an audio signal they attend. We introduce instruction-based vector steering, which constructs a steering vector by contrasting activations from differently instructed prompts while keeping the audio fixed. Through a systematic probe of LALM attention, we find that - unlike standard prompting or audio-based steering - this intervention significantly redistributes the temporal attention allocated to audio tokens, concentrating it on acoustically relevant regions. We then show that this attention shift is behaviorally meaningful: in a controlled three-event setting, reading out the temporal position of maximal steering-induced attention change recovers the location of a queried sound event without any training, attaining 60.87% and 68.72% overlap with ground-truth intervals on Qwen2-Audio and Audio Flamingo 3, far above direct prompting (31.84%, 46.75%) and random baselines (27.74%). Our results characterize a mechanistic property of instruction-based steering in LALMs and provide a training-free probe for the latent temporal structure these models encode.

URL PDF HTML ☆

赞 0 踩 0

2606.11399 2026-06-11 cs.CL 新提交

Scenario-based Probing and Steering Cultural Values in Large Language Models--Extended Version

基于场景的大型语言模型文化价值观探测与引导——扩展版

Trung Duc Anh Dang, Tung Kieu, Sarah Masud

发表机构 * University of Copenhagen（哥本哈根大学）； Aalborg University（奥尔堡大学）

AI总结提出基于场景的行为困境方法，通过令牌级概率和激活引导探测并调整LLM在英格尔哈特-韦尔泽尔文化轴上的潜在价值观，发现不同文化维度的引导存在耦合效应。

Comments 18 pages

详情

AI中文摘要

大型语言模型（LLM）被部署在不同文化背景下，但往往反映出从训练数据中继承的同质化价值观。对文化一致性的评估通常依赖于直接提示调查式问题，这常常引发中性或安全对齐的回应，无法捕捉模型的潜在偏好。我们提出了一个框架，用于沿着世界价值观调查（WVS）的英格尔哈特-韦尔泽尔两个轴探测和引导LLM中的潜在文化表征。通过将社会价值观问题转化为基于场景的行为困境，我们提取令牌级概率来测量隐含价值观，并应用激活引导（可选地与基于国家的提示结合），无需重新训练即可改变模型行为。在三个开源LLM和四种目标文化中，我们发现引导能力存在显著差异，并识别出潜在纠缠，即沿着一个文化维度的干预会引发另一个维度的变化。这种耦合反映了人类WVS数据中的相关性，并在激活、提示和混合引导中持续存在。它限制了轴独立的对齐，尽管一般任务性能基本保持。

英文摘要

Large Language Models (LLMs) are deployed across cultural contexts but often reflect homogenized values inherited from training data. Evaluations of cultural alignment typically rely on direct prompting with survey-style questions, which frequently elicit neutral or safety-aligned responses and fail to capture underlying model preferences. We propose a framework for probing and steering latent cultural representations in LLMs along the two Inglehart--Welzel axes of the World Values Survey (WVS). By translating social value questions into scenario-based behavioral dilemmas, we extract token-level probabilities to measure implicit values and apply activation steering, optionally combined with country-conditioned prompting, to shift model behavior without retraining. Across three open-source LLMs and four target cultures, we find substantial variation in steerability and identify latent entanglement, where interventions along one cultural dimension induce shifts along another. This coupling mirrors correlations in human WVS data and persists across activation, prompt, and hybrid steering. It constrains axis-independent alignment, though general task performance is largely preserved.

URL PDF HTML ☆

赞 0 踩 0