arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 3844
专题追踪
2606.14999 2026-06-16 cs.LG 新提交

Unlocking Latent Dimensions: Exploring Representations of Large-Scale X-ray Scattering Data using Variational Autoencoders

解锁潜在维度:使用变分自编码器探索大规模X射线散射数据的表示

Monika Choudhary, Xiaoya Chong, Runbo Jiang, Wiebke Koepp, Petrus H. Zwart, Damon English, Gregory M. Su, Eric Schaible, Chenhui Zhu, Mostafa Nassr, Noah P. Wamble, Kelvin Kam-Yun Li, Jonathan M. Chan, Jose Carlos Diaz, Cameron McKay, Lynn Katz, Benny Freeman, Guillaume Freychet, Yevgen Matviychuk, Eliot Gann, Daniel B. Allan, Benedikt Sochor, Frank Schluenzen, Stephan V. Roth, Ethan Crumlin, Dylan McReynolds, Tanny Chavez, Alexander Hexemer

发表机构 * Advanced Light Source, Lawrence Berkeley National Laboratory(劳伦斯伯克利国家实验室先进光源) Center for Advanced Mathematics for Energy Research Applications, Lawrence Berkeley National Laboratory(劳伦斯伯克利国家实验室能源研究应用高级数学中心) Molecular Biophysics & Integrated Bioimaging Division, Lawrence Berkeley National Laboratory(劳伦斯伯克利国家实验室分子生物物理学与综合生物成像部) Berkeley Synchrotron Infrared Structural Biology program, Lawrence Berkeley National Laboratory(劳伦斯伯克利国家实验室伯克利同步辐射红外结构生物学项目) Materials Sciences Division, Lawrence Berkeley National Laboratory(劳伦斯伯克利国家实验室材料科学部) McKetta Department of Chemical Engineering, University of Texas(德克萨斯大学麦凯塔化学工程系)

AI总结 针对X射线散射数据离线探索和实时分析两大挑战,训练领域特定注意力卷积变分自编码器(C-VAE),学习低维表示以捕捉结构变化,并集成到MLExchange平台的Latent Space Explorer中,支持交互式结构探索。

详情
AI中文摘要

科学用户设施产生的X射线散射数据速度超过传统工作流的处理能力。我们针对两种场景解决这一挑战:离线数据集探索和实时在线分析。我们在150万张X射线散射图像上训练了一个领域特定的基于注意力的卷积变分自编码器(C-VAE),以学习捕捉跨不同实验条件的结构变化的低维表示。学习到的潜在空间揭示了反映实验进展的组织良好的聚类和平滑轨迹。它还支持跨不同结构状态的受控合成散射图像生成。当未经重新训练部署时,该模型将两个同步加速器设施的时间分辨薄膜形成实验组织成可解释的潜在结构。与通用视觉基础模型DINOv3(ViT-7B)的基准测试表明,领域特定训练为散射数据产生了更可解释的潜在组织。两个工作流都集成在MLExchange平台的Latent Space Explorer中,支持跨存档数据集和实时实验的交互式结构探索。

英文摘要

Scientific user facilities generate X-ray scattering data faster than traditional workflows can process them. We address this challenge across two settings, offline dataset exploration and live on-the-fly analysis. We train a domain-specific attention-based Convolutional Variational Autoencoder (C-VAE) on 1.5 million X-ray scattering images to learn low-dimensional representations capturing structural variation across diverse experimental conditions. The learned latent space reveals well-organized clusters and smooth trajectories reflecting experimental progression. It further supports controlled synthetic scattering image generation across diverse structural states. When deployed without retraining, the model organizes time-resolved film formation experiments at two synchrotron facilities into interpretable latent structures. Benchmarking against DINOv3 (ViT-7B), a general-purpose vision foundation model, demonstrates that domain-specific training yields more interpretable latent organization for scattering data. Both workflows are integrated within Latent Space Explorer, a component of the MLExchange platform, supporting interactive structural exploration across archived datasets and live experiments.

2606.14997 2026-06-16 cs.AI cs.LG 新提交

AI Engram: In Search of Memory Traces in Artificial Intelligence

AI Engram: 在人工智能中寻找记忆痕迹

Jea Kwon, Dong-Kyum Kim, Jiwon Kim, Yonghyun Kim, Woong Kook, Meeyoung Cha

发表机构 * University of California, Berkeley(加州大学伯克利分校) KAIST(韩国科学技术院)

AI总结 提出几何框架,通过约束逆问题形式化神经科学标准,识别深度神经网络中的记忆痕迹(AI engram),实现记忆的线性组合与擦除,无需迭代优化。

Comments Accepted to ICML 2026 (Oral). Code is available at https://github.com/jeakwon/ai-engram/

详情
AI中文摘要

记忆形成是智能的基础,但深度神经网络是否保留类似于生物记忆单元的可识别记忆痕迹仍是一个未解问题。本文引入一个几何框架,通过将神经科学标准(特异性、再激活、充分性和必要性)形式化为约束逆问题,来识别此类“AI engram”。我们推导出一个闭式估计器,从全局纠缠参数中分离出单个记忆痕迹,并证明这一生物学启发的解对应于参数流形上的自然梯度更新。AI engram 允许对学习知识进行手术式操作:任何记忆子集可以通过线性算术进行组合或擦除,无需迭代优化。从简单 MLP 到大语言模型的实验证明了 AI engram 的因果有效性和显著可扩展性。总之,这些结果桥接了生物记忆理论与人工表示学习,并提供了关于深度网络如何在分布式存储中同时支持功能特异性的几何洞见。

英文摘要

Memory formation is fundamental to intelligence, yet whether deep neural networks preserve identifiable memory traces analogous to biological memory units remains an open question. This work introduces a geometric framework to identify such "AI engrams" by formalizing the neuroscientific criteria of specificity, reactivation, sufficiency, and necessity into a constrained inverse problem. We derive a closed-form estimator that isolates individual memory traces from globally entangled parameters, and show that this biologically-derived solution corresponds to a natural gradient update on the parameter manifold. AI engrams enable surgical manipulation of learned knowledge: any subset of memories can be composed or erased through linear arithmetic, without iterative optimization. Experiments ranging from simple MLPs to LLMs demonstrate the causal validity and substantial scalability of AI engrams. Together, these results bridge theories of biological memory and artificial representation learning and offer geometric insight into how deep networks simultaneously support functional specificity within distributed storage.

2606.14981 2026-06-16 cs.RO cs.AI cs.LG 新提交

Inference-time Policy Steering via Vision and Touch

通过视觉和触觉进行推理时策略引导

Yilin Wu, Zilin Si, Zeynep Temel, Oliver Kroemer, Andrea Bajcsy

发表机构 * Carnegie Mellon University(卡内基梅隆大学)

AI总结 提出ViTaL框架,通过视觉采样验证和触觉引导扩散编辑的双层优化,在推理时引导机器人策略,显著提升接触丰富操作任务的成功率。

详情
AI中文摘要

推理时引导通过在部署前验证候选动作来适应预训练的生成式机器人策略。虽然先前的方法通常仅使用视觉观察进行验证,但对于接触丰富的操作任务,仅靠视觉往往不足,因为成功取决于全局任务进展和微妙的局部交互(如接触力)。我们提出了ViTaL,一个视觉-触觉推理时引导框架,将多模态引导形式化为双层优化问题。在高层,视觉采样与验证执行长时域模式选择,决定机器人应执行何种行为。在低层,触觉引导的扩散编辑在较短时域内细化所选动作序列,以满足局部接触要求。为了支持基于结果的引导,ViTaL学习了一个视觉-触觉潜在世界模型,并采用了语义对齐的视觉和触觉验证器,包括一个新颖的文本条件触觉奖励,直接在潜在空间中对预测的触觉未来进行评分。在三个真实世界的接触丰富操作任务中,ViTaL相对于基础策略将整体成功率提高了51%,比单模态引导至少高出33%,并且比朴素多模态融合至少高出20%。网站:https://yilin-wu98.github.io/vital_website。

英文摘要

Inference-time steering adapts pre-trained generative robot policies during deployment by verifying candidate actions before execution. While prior methods typically perform this verification only with visual observations, vision alone is often insufficient for contact-rich manipulation, where success depends on both global task progress and subtle local interactions such as contact force. We introduce ViTaL, a visuo-tactile inference-time steering framework that formulates multimodal guidance as a bi-level optimization problem. At the high level, visual sampling-and-verification performs long-horizon mode selection, deciding what behavior the robot should execute. At the low level, tactile-guided diffusion editing refines the selected action sequence over a shorter horizon to satisfy local contact requirements. To support outcome-based steering, ViTaL learns a visuo-tactile latent world model and employs semantically aligned visual and tactile verifiers, including a novel text-conditioned tactile reward that scores predicted tactile futures directly in latent space. Across three real-world contact-rich manipulation tasks, ViTaL improves overall success by 51% over the base policy, outperforms unimodal steering by at least 33%, and exceeds naive multimodal fusion by at least 20%. Website: https://yilin-wu98.github.io/vital_website.

2606.14972 2026-06-16 cs.CV 新提交

ReGenHuman: Re-Generating Human Appearances for Realistic Full-Body Video Anonymization

ReGenHuman: 重新生成人体外观以实现逼真的全身视频匿名化

Adam Sun, Eshaan Barkataki, Arnold Milstein, Gordon Wetzstein, Ehsan Adeli

发表机构 * Stanford University(斯坦福大学)

AI总结 提出ReGenHuman,首个同时实现逼真、时间一致且天生匿名的全身视频匿名化流水线,采用“重新生成而非编辑”范式,利用结构条件微调视频扩散模型合成人体区域。

详情
AI中文摘要

匿名化以人为中心的视频数据是一个研究不足的问题。先前的匿名化技术要么以牺牲真实性和下游实用性为代价进行模糊或遮盖像素,要么以牺牲时间一致性为代价逐帧生成。我们引入了ReGenHuman,这是第一个同时实现逼真、时间一致且天生匿名的全身视频匿名化流水线。与过去直接遮盖或编辑输入的方法相反,我们提出了一种“重新生成,而非编辑”的范式。我们的方法将2D姿态、分割和单目深度组合成两个互补的条件流——StructAll和StructHuman,用于在野外人体视频上微调视频到视频的扩散骨干网络,完全从无身份的结构线索合成人体区域。我们在隐私、质量和实用性方面评估了我们的模型,并表明我们的ReGenHuman在所有三个轴上与当前基线相比实现了最佳权衡。我们进一步表明,我们的匿名化视频对于下游任务(包括视频问答)仍然有效。

英文摘要

Anonymizing human-centric video data is an understudied problem. Prior anonymization techniques either blur or redact pixels at the cost of realism and downstream utility, or generate frame-by-frame at the cost of temporal coherence. We introduce ReGenHuman, the first full-body video anonymization pipeline that is simultaneously realistic, temporally consistent, and anonymous by construction. Contrary to past approaches which redact or edit the inputs directly, we propose a regenerate, don't edit paradigm. Our approach composites 2D pose, segmentation, and monocular depth into two complementary conditioning streams - StructAll and StructHuman, which are used to fine-tune a video-to-video diffusion backbone on in-the-wild human videos, synthesizing the human regions entirely from identity-free structural cues. We evaluate our model on privacy, quality, and utility, and show that our ReGenHuman achieves the best tradeoff across all three axes against current baselines. We further show that our anonymized videos remain effective for downstream tasks, including video question answering.

2606.14971 2026-06-16 cs.LG cs.AI 新提交

FastMix: Fast Data Mixture Optimization via Gradient Descent

FastMix: 通过梯度下降实现快速数据混合优化

Haoru Tan, Sitong Wu, Yanfeng Chen, Jun Xia, Ruobing Xie, Bin Xia, Xingwu Sun, Xiaojuan Qi

发表机构 * University of Hong Kong(香港大学) Tencent(腾讯) Chinese University of Hong Kong(香港中文大学)

AI总结 提出FastMix框架,将数据混合选择重新表述为双层优化问题,通过联合优化混合系数和模型参数,实现高效、可扩展的数据混合发现,在预训练和后训练中均优于基线方法且大幅降低搜索成本。

Journal ref ICLR-2026

详情
AI中文摘要

虽然大规模和多样化的数据集推动了大型模型的最新进展,但确定预训练和后训练的最佳数据混合仍然是一个重要的开放问题。我们通过FASTMIX应对这一挑战,这是一个新颖的框架,在仅训练单个代理模型的同时自动发现数据混合。FASTMIX不依赖预定义的启发式方法或资源密集型模拟,而是联合优化混合系数和模型参数,显著提高了相对于先前方法的效率和可扩展性。FASTMIX的核心是将混合选择重新表述为一个双层优化问题。在这种重新表述下,我们证明优化混合比例在数学上等价于在均匀源采样下分配每个源的损失权重。这将混合系数直接嵌入到可微分的迭代优化目标中,从而能够对混合和模型进行高效的基于梯度的优化。为了解决优化问题,FASTMIX实现了一个近似迭代优化过程,交替进行(i)根据当前混合比例对采样的数据更新模型参数(内循环)和(ii)基于验证反馈更新混合比例(外循环)。在预训练和后训练中,FASTMIX均优于基线方法,同时大幅降低了搜索成本。代码见 https://github.com/hrtan/fastmix

英文摘要

While large and diverse datasets have driven recent advances in large models, identifying the optimal data mixture for pre-training and post-training remains a significant open problem. We address this challenge with FASTMIX, a novel framework that automates data mixture discovery while training only a single proxy model. Instead of relying on predefined heuristics or resource-intensive simulations, FASTMIX jointly optimizes mixture coefficients and model parameters, substantially improving efficiency and scalability over prior approaches. At the core of FASTMIX is a reformulation of mixture selection as a bilevel optimization problem. Under this reformulation, we show that optimizing mixture ratios is mathematically equivalent to assigning per-source loss weights under uniform source sampling. This embeds the mixture coefficients directly into the differentiable iterative optimization objective, enabling efficient, gradient-based optimization of both mixture and model. To solve the optimization problem, FASTMIX implements an approximate iterative optimization procedure, alternating between (i) updating model parameters on data sampled according to current mixture ratios (inner loop) and (ii) updating mixture ratios based on validation feedback (outer loop). Across pre- and post-training, FASTMIX outperforms baselines while drastically reducing search cost. Code (https://github.com/hrtan/fastmix)

2606.14970 2026-06-16 cs.LG 新提交

Zero-order Parameter-free Optimization for LMO-based Methods: Novel Approach for Efficient Fine-tuning

基于LMO方法的零阶无参数优化:高效微调的新方法

Dmitriy Bystrov, Daniil Medyakov, Dmitry Bylinkin, Aleksandr Beznosikov

发表机构 * University of California, Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学)

AI总结 针对大模型微调中反向传播内存开销大、零阶优化对步长和平滑参数敏感的问题,提出统一无梯度训练、自适应调参和非欧几里得更新几何的AdaNAGED方法,并在OPT-1.3B模型上验证有效性。

Comments 29 pages, 1 table

详情
AI中文摘要

微调大型语言模型(LLM)已成为现代优化的核心应用,使预训练模型能够适应多样化的下游任务和特定领域数据。大规模微调的主要障碍是反向传播的内存开销,这需要存储激活值、梯度和优化器状态。零阶(ZO)优化提供了一种内存高效的替代方案,但其性能对步长和平滑参数高度敏感,通常需要昂贵的任务特定调参。无参数(PF)优化通过在没有问题相关常数先验知识的情况下调整算法参数来解决这一问题。此外,大规模微调可以受益于几何感知更新,该更新考虑了参数块的异质结构,这可以通过利用线性最小化预言(LMO)的方法来建模。在这项工作中,我们研究了基于LMO的ZO优化的PF自适应,并引入了$\texttt{AdaNAGED}$,一种统一无梯度训练、自适应调参和非欧几里得更新几何的方法。我们建立了收敛保证,并在使用$\texttt{OPT}-1.3\mathrm{B}$模型的大规模LLM微调任务上验证了该方法。

英文摘要

Fine-tuning large language models (LLMs) has become a central application of modern optimization, enabling pretrained models to adapt to diverse downstream tasks and domain-specific data. A major obstacle in large-scale fine-tuning is the memory overhead of backpropagation, which requires storing activations, gradients, and optimizer states. Zeroth-order (ZO) optimization offers a memory-efficient alternative, but its performance is highly sensitive to the stepsize and smoothing parameter, often requiring costly task-specific tuning. Parameter-free (PF) optimization addresses this issue by adapting algorithmic parameters without prior knowledge of problem-dependent constants. Moreover, large-scale fine-tuning can benefit from geometry-aware updates that account for the heterogeneous structure of parameter blocks, which can be modeled through methods that exploit linear minimization oracle (LMO). In this work, we study PF adaptation for LMO-based ZO optimization and introduce $\texttt{AdaNAGED}$, a method that unifies gradient-free training, adaptive tuning, and non-Euclidean update geometry. We establish convergence guarantees and validate the method on large-scale LLM fine-tuning task with $\texttt{OPT}-1.3\mathrm{B}$ model.

2606.14969 2026-06-16 cs.RO 新提交

Multimodal Physiological Assessment of Contact-Rich Physical Human-Robot Interaction Under Varying Environmental Conditions

多变环境条件下接触丰富型物理人机交互的多模态生理评估

Yanyi Chen, Xi Wang, Min Deng

发表机构 * Texas Tech University(德克萨斯理工大学) Texas A&M University(德克萨斯农工大学) University of Tennessee, Knoxville(田纳西大学诺克斯维尔分校)

AI总结 本研究通过多模态生理测量(EDA、sEMG、眼动追踪)和主观舒适度评估,揭示了接触丰富型物理人机交互中操作者因环境压力(温度、噪声、照度)而付出的隐藏生理代价,并发现操作者通过增加生理努力维持任务性能的补偿机制。

详情
AI中文摘要

现实环境中的物理人机交互(pHRI)使操作者在接触丰富型任务中暴露于波动的环境条件。传统的以任务为中心的评估忽视了这些压力源带来的生理负担。因此,我们进行了一项多模态实证研究,涉及18种不同的温度、噪声和照度组合下的接触丰富型追踪任务。同步记录了皮肤电活动(EDA)、表面肌电图(sEMG)、眼动追踪数据和主观环境舒适度评分。将这些生理信号与执行数据一起评估,揭示了客观性能未捕捉到的隐藏生理成本。结果显示,任务性能在所有环境条件下保持稳定。自主神经负荷(由紧张性皮肤电导水平(SCL)指示)随温度升高而增加,而身体和认知负荷不受影响。感知环境舒适度与追踪误差或完成时间无显著关联。这些发现揭示了一种补偿机制,即操作者通过增加生理努力来抑制热不适,从而保持一致的性能。这一见解推动了开发生理感知控制架构的动机,该架构利用实时生理指标来减少非结构化环境中操作者的工作负荷。

英文摘要

Physical human-robot interaction (pHRI) in real-world settings exposes operators to fluctuating environmental conditions during contact-rich tasks. Traditional task-centric evaluations overlook the physiological burdens imposed by these stressors. Therefore, we conducted a multimodal empirical study involving contact-rich tracing tasks under 18 distinct combinations of temperature, acoustic noise, and illuminance. Synchronously, we recorded electrodermal activity (EDA), surface electromyography (sEMG), eye-tracking data, and subjective environmental comfort ratings. Evaluating these physiological signals alongside execution data revealed hidden physiological costs not captured by objective performance. The results revealed that task performance remained stable across all environmental conditions. Autonomic workload, indexed by tonic skin conductance level (SCL), increased with temperature, while physical and cognitive workload were unaffected. Perceived environmental comfort showed no significant association with tracing error or completion time. These findings reveal a compensatory mechanism where operators maintain consistent performance by increasing their physiological effort to suppress thermal discomfort. Such insight motivates the development of physiology-aware control architectures that leverage real-time physiological metrics to reduce operator workload in unstructured environments.

2606.14965 2026-06-16 cs.LG cs.DB 新提交

Benchmarking Instance-Dependent Label Noise with Controlled Corruptions

具有受控扰动的实例相关标签噪声基准测试

Shadman Islam, Agustinus Kristiadi, Mostafa Milani

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出CILN框架,通过受控输入扰动生成实例相关标签噪声,构建90个基准设置,揭示噪声结构对算法行为的重要影响。

Comments 12-page conference submission

详情
AI中文摘要

合成实例相关标签噪声(IDN)基准被广泛用于评估噪声标签学习方法,但现有方法通常通过不完美的标注器或分类器评分器生成噪声,模糊性的来源隐含。我们引入CILN,一种通过受控输入扰动创建IDN的基准生成框架。一个多样化的投票池对受扰动的实例进行标注,产生基准数据集,其中模糊性的来源和严重程度都是明确且可控的。使用CIFAR-10、MNIST和Adult,我们构建了跨越多个扰动族和严重级别的90个基准设置。我们的实验表明,生成的基准展现出真正的实例相关噪声,提供多样化的混淆结构,并且在CIFAR-10上,可以产生比现有合成IDN基准更接近人类不确定性的标签分布。我们进一步证明,扰动介导的IDN可以暴露流行噪声标签学习方法(包括Co-Teaching和DivideMix)的失败模式,这些模式在可比水平的评分者错误噪声下未被观察到。这些发现表明,噪声结构(而不仅仅是噪声率)在基准难度和算法行为中起着重要作用。通过使模糊性生成明确且可控,CILN为研究在不同实例难度来源下的噪声标签学习提供了一个补充性的基准测试框架。

英文摘要

Synthetic instance-dependent label noise (IDN) benchmarks are widely used to evaluate noisy-label learning methods, yet existing approaches typically generate noise through imperfect annotators or classifier raters, leaving the source of ambiguity implicit. We introduce CILN, a benchmark generation framework that creates IDN through controlled input corruptions. A diverse voter pool labels corrupted instances, producing benchmark datasets in which both the source and severity of ambiguity are explicit and controllable. Using CIFAR10, MNIST, and Adult, we construct 90 benchmark settings spanning multiple corruption families and severity levels. Our experiments show that the resulting benchmarks exhibit genuine instance-dependent noise, provide diverse confusion structures, and, on CIFAR-10, can produce label distributions that are closer to human uncertainty than an existing synthetic IDN benchmark. We further demonstrate that corruption-mediated IDN can expose failure modes of popular noisy-label learning methods, including Co-Teaching and DivideMix, that are not observed under comparable levels of rater-fallibility noise. These findings suggest that noise structure, not only noise rate, plays an important role in benchmark difficulty and algorithm behavior. By making ambiguity generation explicit and controllable, CILN provides a complementary benchmarking framework for studying noisy-label learning under diverse sources of instance difficulty.

2606.14963 2026-06-16 cs.CV cs.AI 新提交

Multi-Modal Attention for Automated Disaster Damage Assessment Using Remote Sensing Imagery and Deep Learning

基于遥感影像和深度学习的多模态注意力自动灾害损伤评估

Tewodros Syum Gebre, Jagrati Talreja, Leila Hashemi-Beni

发表机构 * Built Environment Department, College of Science and Technology, North Carolina A&T State University(北卡罗来纳农工州立大学科技学院建筑环境系) United Nations University Institute for Water, Environment and Health(联合国大学水、环境与健康研究所)

AI总结 提出一种多模态注意力机制融合双时相遥感影像的深度学习框架,实现建筑物损伤四分类(无/轻微/严重/毁坏),准确率达94.90%。

Comments This paper has been accepted for publication in ISPRS Congress 2026 and the 47th Canadian Symposium on Remote Sensing (CSRS 2026) Annals

详情
AI中文摘要

及时准确的灾害损伤评估对于有效的应急响应、资源分配和恢复至关重要。传统方法通常依赖人工检查或稀疏数据,往往速度慢且易出错。本文介绍了一种利用遥感影像和深度学习自动化建筑损伤分类的新框架。使用灾前和灾后卫星影像,我们的模型将建筑物分为四个损伤等级:无损伤、轻微损伤、严重损伤和毁坏。核心创新是一种多模态注意力机制,融合双时相特征以显式检测和评估结构变化。我们采用轻量级ConvNeXT-Tiny骨干网络,确保高效处理而不牺牲性能。主要贡献包括:(1)用于多模态数据融合的交叉注意力模块,(2)针对大规模数据集的优化预处理流程,以及(3)鲁棒的数据增强技术。在大规模灾害数据集上的实验表明,总体分类准确率达到94.90%。该模型能有效区分损伤类别,并对不完整数据保持鲁棒性。本系统显著提高了评估速度和准确性,有助于应急响应人员优先安排干预措施。本研究通过将多时相影像与深度学习相结合,推进了自动化灾害损伤检测,为实时响应提供了可扩展的解决方案。

英文摘要

Timely and accurate disaster damage assessment is crucial for effective emergency response, resource allocation, and recovery. Traditional methods, which often rely on manual inspections or sparse data, are typically slow and error-prone. This paper introduces a novel framework leveraging remote sensing imagery and deep learning to automate building damage classification. Using pre- and post-disaster satellite imagery, our model categorizes buildings into four damage levels: no damage, minor damage, major damage, and destroyed. The core innovation is a multi-modal attention mechanism that fuses bi-temporal features to explicitly detect and assess structural changes. We employ a lightweight ConvNeXT-Tiny backbone to ensure efficient processing without compromising performance. Key contributions include: (1) a cross-attention module for multi-modal data fusion, (2) an optimized preprocessing pipeline for large-scale datasets, and (3) robust data augmentation techniques. Experiments on a large-scale disaster dataset demonstrate an overall classification accuracy of 94.90%. The model effectively discriminates between damage categories and remains resilient to incomplete data. This system significantly improves assessment speed and accuracy, aiding emergency responders in prioritizing interventions. This work advances automated disaster damage detection by integrating multi-temporal imagery with deep learning, offering a scalable solution for real-time response.

2606.14961 2026-06-16 cs.CL 新提交

CoRA: Confidence-Rationale Alignment for Reliable Chain-of-Thought Reasoning

CoRA: 面向可靠思维链推理的置信度-理由对齐

Juming Xiong, Weixin Liu, Kevin Guo, Congning Ni, Junchao Zhu, Chongyu Qu, Chao Yan, Katherine Brown, Avinash Baidya, Xiang Gao, Bradley Malin, Zhijun Yin

发表机构 * Vanderbilt University(范德比尔特大学) Vanderbilt University Medical Center(范德比尔特大学医学中心) Intuit AI Research(Intuit AI研究)

AI总结 提出GRPO强化学习框架,联合奖励答案正确性、置信度与理由支持度,减少置信度与理由对齐误差,提升推理可靠性。

详情
AI中文摘要

思维链推理可以提升大语言模型性能,但当伴随的理由看似合理但不完整或支持不足时,高答案置信度可能具有误导性。我们研究置信度-理由对齐:模型对其承诺答案的置信度是否由其生成的理由所证明。我们引入基于GRPO的强化学习框架,联合奖励答案正确性、承诺答案概率以及基于评分标准的理由支持度,其中评分标准评估理由的立足点、连贯性、任务匹配度以及与所选答案的关联性,且不向评判者揭示正确答案。在MedQA、MathQA和OpenBookQA上使用三个开源大语言模型,与未调优检查点、SFT和仅正确性GRPO相比,我们的方法将置信度-理由对齐误差降低了高达26.51%,同时保持了有竞争力的准确率并经常改进校准。这些结果表明,可靠的思维链推理不仅需要自信的答案,还需要实质性支持这些答案的理由。

英文摘要

Chain-of-thought (CoT) reasoning can improve LLM performance, but high answer confidence may be misleading when the accompanying CoT rationale is plausible yet incomplete or poorly supported. We study confidence--rationale alignment: whether a model's confidence in its committed answer is justified by its generated rationale. We introduce a GRPO-based reinforcement learning framework that jointly rewards answer correctness, committed-answer probability, and rubric-based rationale support, where the rubric assesses grounding, coherence, task match, and connection to the selected answer without revealing the gold answer to the judge. Across MedQA, MathQA, and OpenBookQA using three open-weight LLMs, our method reduces the confidence--rationale alignment error by up to 26.51% compared with untuned checkpoints, SFT, and correctness-only GRPO, while maintaining competitive accuracy and often improving calibration. These results show that reliable CoT reasoning requires not only confident answers, but rationales that substantively support them.

2606.14960 2026-06-16 cs.LG cs.CY 新提交

Leveraging Physiological Signals to Predict Exam Outcomes with Machine Learning

利用生理信号通过机器学习预测考试结果

Lala Yamazaki, Ramchandra Rimal

发表机构 * Middle Tennessee State University(中田纳西州立大学)

AI总结 研究使用机器学习模型分析考试期间的生理数据(皮肤电活动、心率、皮肤温度)预测成绩,比较了逻辑回归、随机森林、SVM及LSTM、GRU、Transformer等模型,发现随机森林在效率和可解释性上表现优异,Transformer与LSTM/GRU性能相当。

Comments 9 figures, and 5 tables

详情
AI中文摘要

本研究探讨了利用机器学习模型预测考试结果的可行性,数据来自考试期间收集的生理信号。分析了包括皮肤电活动、心率和皮肤温度在内的生理压力指标,以揭示其与学业表现的关系。采用了多种机器学习方法,从逻辑回归、随机森林和支持向量机等标准模型,到更先进的架构,包括Transformer、长短期记忆(LSTM)和门控循环单元(GRU)模型。这种多样性旨在有效捕捉数据中的复杂交互。一个关键焦点是评估Transformer在处理数值数据方面的适应性,并评估其在此新情境下的性能。使用准确率、精确率、召回率和F1分数等标准性能指标来比较模型效果。实验结果表明,虽然深度学习模型通常擅长捕捉生理数据中的复杂关系,但像随机森林这样的简单模型有时能实现更优性能,同时提供计算效率和可解释性。此外,Transformer表现出显著的多功能性,展现出与LSTM和GRU模型相当的性能。本研究强调了尝试与问题目标相符的广泛模型类别的重要性,平衡了精度、效率和可解释性。通过阐明生理信号与学业表现之间的关系,本研究有助于理解影响学生心理健康的压力因素,并进一步促进利用生理数据提升学生福祉和学业成果。

英文摘要

This study investigates the application of machine learning models to predict exam outcomes using physiological data collected during examination sessions. Physiological stress indicators, including electrodermal activity, heart rate, and skin temperature, were analyzed to uncover their association with academic performance. A variety of machine learning approaches were employed, ranging from standard models like logistic regression, random forest, and support vector machines to more advanced architectures, including transformers, long short-term memory (LSTM), and gated recurrent unit (GRU) models. This diversity aimed to capture the complex interactions within the data effectively. A key focus was assessing the adaptability of transformers in processing numerical data and evaluating their performance in this novel context. Standard performance metrics, such as accuracy, precision, recall, and F1-score, were used to compare model efficacy. The experimental results demonstrate that while deep learning models generally excel at capturing complex relationships in physiological data, simpler models like random forests can sometimes achieve superior performance while offering computational efficiency and interpretability. Furthermore, transformers demonstrated notable versatility, showcasing performances comparable to those of the LSTM and GRU models. This research underscores the importance of experimenting with a broad class of models that align with the objectives of the problem at hand, balancing precision, efficiency, and interpretability. By elucidating the relationships between physiological signals and academic performance, this study contributes to understanding stressors affecting students' mental health. It further promotes leveraging physiological data to enhance student well-being and academic outcomes.

2606.14956 2026-06-16 cs.LG 新提交

A Comparative Study of Graph Neural Network Layer Selection for Interaction Modelling in Driving Trajectory Prediction

图神经网络层选择用于驾驶轨迹预测中交互建模的比较研究

George Daoud, Mohamed El-Darieby

发表机构 * Ontario Tech University(安大略理工大学) Assiut University(艾斯尤特大学)

AI总结 本文比较了19种图神经网络层在轨迹预测中的空间和时间处理能力,发现ARMA、Chebyshev和拓扑感知层表现最佳,并总结了基于和聚合、多头注意力和不同跳距权重等设计原则。

Comments 6 pages, 1 figure

Journal ref The IEEE Intelligent Vehicles Symposium (IEEE IV 2026)

详情
AI中文摘要

自动驾驶系统依赖精确的轨迹预测来规划安全高效的移动。图神经网络(GNN)已成为对道路智能体间时空交互建模的一种有前景的方法。然而,为轨迹预测设计GNN架构仍缺乏标准化,关于哪些图层能有效捕捉空间交互和时间动态的指导很少。本文对19种图层类型进行了详细的比较研究,重点关注它们的空间和时间处理能力,以发现最有效的轨迹预测架构。在所探索的超参数设置中,我们突出了五种突出的图层组合,其中ARMA、Chebyshev和拓扑感知层始终优于其他层。除了性能指标外,我们的发现还产生了实用的设计原则:基于和的聚合比基于均值的方法更有效,多头注意力机制能够实现更丰富的交互,为不同跳距分配不同权重显著提高了预测精度。这些发现为设计更可解释和有效的轨迹预测模型提供了有用的指导。

英文摘要

Autonomous driving systems rely on precise trajectory prediction to plan safe and efficient movement. Graph Neural Networks (GNNs) have become a promising approach for modelling spatiotemporal interactions among road agents. However, designing GNN architectures for trajectory prediction remains non-standardized, with little guidance on which graph layers effectively capture spatial interactions and temporal dynamics. This paper offers a detailed comparative study of 19 graph layer types, focusing on their spatial and temporal processing capabilities to discover the most effective architectures for trajectory prediction. Within the explored hyperparameter setting, we highlight five standout layer combinations, with ARMA, Chebyshev, and topology-aware layers consistently performing better than others. Beyond performance metrics, our findings yield practical design principles: sum-based aggregation is more effective than mean-based methods, multi-head attention mechanisms enable richer interactions, and assigning different weights to different hop distances significantly improves prediction accuracy. These findings offer useful guidance for designing more interpretable and effective trajectory prediction models.

2606.14945 2026-06-16 cs.LG 新提交

Remember, Don't Re-read: Stateful ReAct Agents for Token-Efficient Autonomous Experimentation

记住,不要重读:用于令牌高效自主实验的有状态ReAct智能体

Faramarz Jabbarvaziri

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出基于LangGraph的有状态ReAct智能体,通过持久化状态和固定大小对话窗口,将自主实验的令牌成本从O(n²)降至O(1),在超参数调优和代码优化任务中分别减少90%和52%的令牌消耗。

详情
AI中文摘要

自动研究模式通过让大语言模型(LLM)迭代修改代码来优化目标指标,从而实现自主实验。然而,其无状态设计在每次迭代中从头重建实验上下文,导致每次迭代的令牌成本为$O(n)$,总成本为$O(n^{2})$。本文将该模式重新表述为使用LangGraph的有状态ReAct智能体,其中类型化的持久化状态通过工具调用接口跨迭代传递实验历史。评估了两个基准:超参数调优(15次迭代,每次迭代观察数据小)和代码性能优化(40次迭代,每次迭代观察数据大,包含完整源代码和基准测试结果)。在超参数调优中,有状态智能体消耗的令牌减少90%(2,492 vs. 24,465)。在代码优化中,有状态智能体消耗的令牌减少52%(627K vs. 1,275K),同时在两项任务上实现了相当的优化质量。令牌减少是结构性的:无状态智能体以每次迭代$O(n)$的成本重读完整历史,而有状态智能体在固定大小的对话窗口内以$O(1)$成本运行。本文详细描述了该架构,使从业者能够为其自己的工作流程实现有状态自动研究智能体。

英文摘要

The autoresearch pattern enables autonomous experimentation by having a large language model (LLM) iteratively modify code to optimize a target metric. Its stateless design, however, reconstructs experimental context from scratch at every iteration, incurring $O(n)$ token cost per iteration and $O(n^{2})$ total. This work reformulates the pattern as a stateful ReAct agent using LangGraph, where typed persistent state carries experimental history across iterations via a tool-calling interface. Two benchmarks are evaluated: hyperparameter tuning (15 iterations, small per-iteration observations) and code performance optimization (40 iterations, large per-iteration observations containing full source code and benchmark results). On hyperparameter tuning, the stateful agent consumes 90\% fewer tokens (2{,}492 vs.\ 24{,}465). On code optimization, the stateful agent consumes 52\% fewer tokens (627K vs.\ 1{,}275K) while achieving comparable optimization quality on both tasks. The token reduction is structural: the stateless agent re-reads the full history at $O(n)$ cost per iteration, while the stateful agent operates within a fixed-size conversation window at $O(1)$ cost. This paper describes the architecture in sufficient detail for practitioners to implement a stateful autoresearch agent for their own workflows.

2606.14943 2026-06-16 cs.CL cs.LG 新提交

Simplifying the Modeling of Arbitrary Conditionals in Natural Language

简化自然语言中任意条件概率的建模

Yinhan Lu, Eric Elmoznino, Léo Gagnon, Sarthak Mittal, Tejas Kasetty, Guillaume Lajoie

发表机构 * Mila — Quebec AI Institute(Mila — 魁北克人工智能研究所) McGill University(麦吉尔大学) Université de Montréal(蒙特利尔大学)

AI总结 提出AC-GPT,通过简单修改标准因果Transformer,实现单次前向传播中评估和采样任意条件(包括过去、未来和混合上下文),保持左到右顺序和下一词预测目标,无需退化标准性能。

详情
AI中文摘要

因果Transformer通过联合分布的自回归分解对序列进行建模,这使得高效的从左到右解码和条件似然计算成为可能。然而,它们无法高效地从任意条件中采样或评估——例如,以过去和未来标记为条件的文本块。最近的工作旨在通过新颖的架构解决这个问题,但通常导致对此类条件概率的次优建模和退化的生成。我们提出了任意条件GPT(AC-GPT),它引入了一个对标准因果Transformer的简单修改,使得在单次前向传播中能够评估和采样任意条件——包括过去、未来和混合上下文。与先前的方法不同,我们的方法保留了标准的从左到右顺序和下一词预测目标,这对于自然语言上的强性能和高效训练都是必不可少的。关键的是,这种兼容性允许现有的LLM被微调以进行任意条件建模。我们的实证结果表明,我们的方法在建模任意条件概率方面优于基线,且不会降低标准的从左到右性能。

英文摘要

Causal Transformers model sequences through an autoregressive factorization of the joint distribution, which enables efficient left-to-right decoding and conditional likelihood computation. However, they cannot tractably sample from or evaluate arbitrary conditionals -- e.g., a block of text conditioned on past and future tokens. Recent work aims to solve this problem through novel architectures, but they often lead to sub-optimal modeling of such conditionals and degraded generations. We propose Arbitrary Conditionals GPT (AC-GPT) which introduces a simple modification to standard causal Transformers to enable evaluating and sampling from arbitrary conditionals -- including past, future, and mixed contexts -- within a single forward pass. Unlike prior approaches, our method preserves the standard left-to-right ordering and next-token prediction objective essential for both strong performance and efficient training on natural language. Crucially, this compatibility allows existing LLMs to be fine-tuned for arbitrary conditioning. Our empirical results indicate that our method outperforms baselines on modeling arbitrary conditionals, without degrading standard left-to-right performance.

2606.14941 2026-06-16 cs.AI 新提交

Semantics-Enhanced Retrieval-Augmented Time Series Forecasting

语义增强的检索增强时间序列预测

Shiqiao Zhou, Zipeng Wu, Holger Schöner, Edouard Fouché, IAG Wilson, Shuo Wang

发表机构 * University of California, Berkeley(加州大学伯克利分校) Max Planck Institute for Intelligent Systems(智能系统马克斯·普朗克研究所) University of Montreal(蒙特利尔大学)

AI总结 针对非平稳性下仅靠时间序列相似性检索不足的问题,提出SERAF框架,通过双模态检索(时间序列及其自生成文本描述)联合利用历史模式,提升预测性能。

Comments Accepted to the ICML 2026 Workshop on Forecasting as a New Frontier of Intelligence

详情
AI中文摘要

时间序列预测模型通常受益于历史模式。受检索增强生成(RAG)启发,最近的研究探索了检索相关历史时间序列片段以增强预测。然而,在非平稳性下,仅依靠时间序列相似性进行检索往往不够。为此,我们提出一种多模态方法:\textbf{S}emantics-\textbf{E}nhanced \textbf{R}etrieval-\textbf{A}ugmented Time Series \textbf{F}orecasting 框架,SERAF。与仅依赖时间序列相似性的主流方法不同,SERAF对时间序列及其自生成的文本描述进行双重检索。它检索两组互补的历史模式和对应的未来,这些被选择性地联合用于指导未来预测。在七个真实世界数据集上的实验表明,与最先进的基线相比,SERAF在桥接时间序列的数值视图和语义视图方面是有效的。

英文摘要

Time series forecasting models often benefit from historical patterns. Inspired by Retrieval-Augmented Generation (RAG), recent research explored retrieving relevant historical time series segments to enhance forecasting. However, relying solely on time series similarity is often insufficient for retrieval under non-stationarity. To address this, we propose a multimodal approach: a \textbf{S}emantics-\textbf{E}nhanced \textbf{R}etrieval-\textbf{A}ugmented Time Series \textbf{F}orecasting framework, SERAF. Unlike mainstream approaches that depend only on time series similarity, SERAF conducts dual retrieval over the time series and their self-generated textual descriptions. It retrieves two complementary sets of historical patterns and corresponding futures, which are selectively and jointly used to guide future predictions. Experiments across seven real-world datasets demonstrate the effectiveness of SERAF in bridging numerical and semantic views of time series compared with state-of-the-art baselines.

2606.14935 2026-06-16 cs.AI 新提交

PrologMCP: A Standardized Prolog Tool Interface for LLM Agents

PrologMCP:面向LLM代理的标准化Prolog工具接口

Agnieszka Mensfelt, Adarsh Prabhakaran, Adrian Haret, Vince Trencsenyi, Kostas Stathis

发表机构 * Royal Holloway, University of London(伦敦大学皇家霍洛威学院)

AI总结 提出PrologMCP,一个通过模型上下文协议将Prolog暴露为状态化工具的任务无关开源服务器,使LLM代理能够通过翻译-运行-检查-修复循环稳健地委托演绎推理,在PARARULE-Plus上达到或超越推理型LLM。

Comments Accepted at Joint Workshop on Statistics and Knowledge Integration for Logic, Learning, Ethical Decisions, and LLMs, 18 July 2026, Lisbon

详情
AI中文摘要

前沿推理调优语言模型在深度演绎任务上仍然失败,而通过扩展内部推理来提升性能的成本很高。符号委托提供了一条补充路径:语言模型翻译问题,求解器执行推理。然而,当前用于逻辑编程的自动形式化管道通常是针对特定任务或代理的定制集成。我们引入了PrologMCP,一个任务无关的开源服务器,通过模型上下文协议(MCP)将Prolog暴露为状态化工具。其紧凑的工具接口、结构化错误报告和每会话隔离使翻译-运行-检查-修复循环成为MCP能力代理的可复用原语。我们在PARARULE-Plus的两个子集上评估了增强PrologMCP的形式化代理与标准和推理LLM(Claude Sonnet 4.6、GPT-4.1和o4-mini)的性能:一个通用样本和一个更具挑战性的样本,针对自然语言推理的特定失败模式。在通用样本上,形式化代理匹配或超越推理LLM(准确率1.00对比1.00/0.998),相比标准模型提升最大(GPT-4.1为0.762)。在挑战性子集上,形式化代理保持接近完美(1.00/0.99),而推理LLM降至0.95/0.94。这些结果表明,通过MCP将推理委托给Prolog是扩展自然语言推理的一种稳健且可检查的替代方案。

英文摘要

Frontier reasoning-tuned language models still fail on deductive tasks at depth, and the cost of improved performance through extended internal reasoning scales poorly. Symbolic delegation offers a complementary route: a language model translates the problem, while a solver performs the inference. However, current autoformalization pipelines for logic programming are typically bespoke integrations tied to particular tasks or agents. We introduce PrologMCP, a task-agnostic, open-source server that exposes Prolog as a stateful tool through the Model Context Protocol (MCP). Its compact tool interface, structured error reporting, and per-session isolation make the translate-run-inspect-repair loop a reusable primitive for MCP-capable agents. We evaluate a formalizer agent enhanced with PrologMCP against standard and reasoning LLMs (Claude Sonnet 4.6, GPT-4.1, and o4-mini) on two subsets of PARARULE-Plus: a general-purpose sample and a more challenging one targeting a specific failure mode of natural-language reasoning. On the general sample, the formalizer matches or exceeds reasoning LLMs (accuracy 1.00 vs.\ 1.00 / 0.998), with the largest gains over standard models (0.762 for GPT-4.1). On the challenging subset, the formalizer remains near-perfect (1.00 / 0.99) while reasoning LLMs drop to 0.95 / 0.94. These results suggest that delegating inference to Prolog via MCP is a robust and inspectable alternative to extended natural-language reasoning.

2606.14934 2026-06-16 cs.LG cs.AI 新提交

Separable Neural Architectures as Physical World Models: from Mathematical Theory to Applications

可分离神经架构作为物理世界模型:从数学理论到应用

Reza T Batley, Andrew Kichline, Sourav Saha

发表机构 * Kevin T. Crofton Department of Aerospace and Ocean Engineering, Virginia Polytechnic Institute and State University(弗吉尼亚理工大学凯文·T·克罗夫顿航空航天与海洋工程系)

AI总结 提出可分离神经架构(SNA),结合神经逼近与张量分解,通过变分框架求解偏微分方程,实现高维问题代数级缩放,并在工程案例中取得显著加速。

详情
AI中文摘要

本文介绍了可分离神经架构(SNA),这是一种结合神经逼近与张量分解的函数表示类。SNA将局部坐标函数(原子)与由稀疏低秩交互对象控制的全局相互作用解耦。该架构具有紧凑且平滑的归纳偏置,非常适合求解偏微分方程(PDE)。当在变分SNA(VSNA)框架下被视为Galerkin试验空间时,该公式满足Lax-Milgram下的经典变分保证:适定性、拟最优性、收敛性和稳定性。在高维时空-参数PDE中,VSNA通过代数级而非指数级缩放来缓解维数灾难。利用完全分解的、张量原生的交替最小二乘(ALS)优化框架,可将此成本降低至维度线性。VSNA在椭圆、双曲和抛物系统中得到验证,显示出与预测的代数谱缩放率高度一致。我们通过两个工程案例研究展示了SNA作为“一次求解,随处查询”的物理世界模型:一个7维参数化制造模拟和一个用于Inconel 718的实验性热-属性反演流程。VSNA在标准笔记本电脑CPU上102秒内执行了1,000,000次蒙特卡洛扫描,相比基于NVIDIA A100 GPU的全网格有限元基线实现了150,000倍加速。它还能在100毫秒内实现实时生成式逆模态重建。这些结果表明,SNA可作为连续参数流形的紧凑数学基础,实现实时反演、优化循环和快速不确定性传播。

英文摘要

This work introduces the Separable Neural Architecture (SNA), a function representational class combining neural approximation with tensor decomposition. The SNA decouples localized coordinate functions (atoms) from global interactions governed by a sparse, low-rank interaction object. This architecture possesses a compact and smooth inductive bias well-suited for solving partial differential equations (PDEs). When viewed as a Galerkin trial space under the variational SNA (VSNA) framework, the formulation satisfies classical variational guarantees under Lax-Milgram: well-posedness, quasi-optimality, convergence, and stability. In high-dimensional spatiotemporal--parametric PDEs, the VSNA mitigates the curse of dimensionality by scaling algebraically rather than exponentially. Exploiting an entirely factorized, tensor-native alternating least squares (ALS) optimization framework reduces this cost to linear in dimension. The VSNA is validated across elliptic, hyperbolic, and parabolic systems, demonstrating close alignment with predicted algebraic and spectral scaling rates. We showcase the SNA as a "solve once, query anywhere" physical world model via two engineering case studies: a 7D parametric manufacturing simulation and an experimental thermal-to-property inversion pipeline for Inconel 718. The VSNA executes a 1,000,000-query Monte Carlo sweep in 102s on a standard laptop CPU, yielding a 150,000x speedup over a full-grid finite element baseline hosted on an NVIDIA A100 GPU. It further enables real-time generative inverse-mode reconstructions under 100ms. These results demonstrate that the SNA serves as a compact mathematical substrate for continuous parameter manifolds to enable real-time inversion, optimization loops, and rapid uncertainty propagation.

2606.14929 2026-06-16 cs.LG cs.AI stat.ML 新提交

Policy Regret for Embedding Model Routing: Contextual Bandits with Low-Rank Experts

嵌入模型路由的策略遗憾:具有低秩专家的上下文赌博机

Yan Dai, Negin Golrezaei, Patrick Jaillet

发表机构 * Operations Research Center, MIT(麻省理工学院运筹学研究中心) Sloan School of Management, MIT(麻省理工学院斯隆管理学院) Department of EECS, MIT(麻省理工学院电气工程与计算机科学系)

AI总结 针对推荐系统中嵌入模型路由问题,形式化为具有低秩专家的对抗性上下文线性赌博机,提出Hypentropy策略梯度算法,实现$\tilde{\mathcal O}(s\sqrt{M T})$线性化策略遗憾。

详情
AI中文摘要

现代推荐系统越来越依赖于将多样化的查询动态路由到多个嵌入模型。尽管具有实际意义,但在对抗性查询、赌博机反馈和模型有限可观测性等现实条件下,该问题仍未得到充分理解。我们将嵌入模型路由形式化为具有低秩专家的对抗性上下文线性赌博机,其中上下文是查询,动作是物品,专家是在低秩潜在表示空间上工作的嵌入模型。我们首先证明,标准遗憾概念存在结构错误指定或统计难解性,并确定了一个对数二次策略类,它足够表达以捕获查询相关的模型路由,同时又足够结构化以允许高效的在线学习。其次,我们提出了一种称为Hypentropy策略梯度(HPG)的策略梯度算法。它在不完全信息下可证明地适应未知的低秩结构,并达到$\tilde{\mathcal O}(s\sqrt{M T})$线性化策略遗憾——其中$s$、$M$和$T$分别是专家的内在秩、模型数量和轮数——从而避免了维度灾难。最后,我们还提供了HPG的计算高效且无需参数调整的实现。

英文摘要

Modern recommendation systems increasingly rely on dynamically routing diverse queries to multiple embedding models. Despite its practical significance, this problem remains poorly understood under realistic conditions like adversarial queries, bandit feedback, and limited observability of models. We formalize embedding model routing as an adversarial contextual linear bandit with low-rank experts, where contexts are queries, actions are items, and experts are the embedding models working on low-rank latent representation spaces. We first establish that standard regret notions suffer from structural misspecification or statistical intractability, and we identify a log-quadratic policy class that is expressive enough to capture query-dependent model routing, yet structured enough to allow efficient online learning. Second, we propose a policy gradient algorithm called Hypentropy Policy Gradient (HPG). It provably adapts to the unknown low-rank structure under incomplete information and attains $\tilde{\mathcal O}(s\sqrt{M T})$ linearized policy regret -- where $s, M$, and $T$ are the intrinsic rank of the experts, the number of models, and the number of rounds -- thus avoiding a curse of dimensionality. Finally, we also provide an computationally efficient and parameter-free implementation of HPG.

2606.14926 2026-06-16 cs.CV 新提交

FlexPooling with Simple Auxiliary Classifiers in Deep Networks

深度网络中带有简单辅助分类器的FlexPooling

Muhammad Ali, Omar Alsuwaidi, Salman Khan

发表机构 * Department of Computer Vision, Mohamed bin Zayed University of Artificial Intelligence (MBZUAI), Abu Dhabi, UAE(阿联酋阿布扎比穆罕默德·本·扎耶德人工智能大学计算机视觉系)

AI总结 提出FlexPooling自适应池化方法,通过学习加权平均替代标准池化,并附加简单辅助分类器,在多个图像分类数据集上提升准确率1-3%。

Journal ref VISAPP 4 (18th), 497-505 2023

详情
AI中文摘要

在计算机视觉中,大多数卷积神经网络的基本流程包括多个特征提取层,其中输入信号在后续每一层中被下采样到更低分辨率。这种下采样过程通常称为池化,是CNN中的基本操作。池化提高了对变换的鲁棒性,减少了可训练参数数量,增加了感受野,并降低了计算时间。由于池化是有损过程,但对于从低级表示中提取高级信息仍然重要,因此保留先前激活中最突出的信息以提高网络判别能力至关重要。标准池化通常使用密集池化方法,如最大池化或平均池化,或通过步长卷积核进行。在本文中,我们提出一种简单而有效的自适应池化方法,称为FlexPooling,它通过学习与网络其余部分联合的激活加权平均来推广平均池化。我们进一步表明,将简单辅助分类器(SAC)附加到CNN上可以提高性能,并证明了所提出方法与标准池化方法相比的有效性。在多个流行图像分类数据集上的实验表明,FlexPooling始终优于基线网络,准确率提升约1%至3%。

英文摘要

In computer vision, the basic pipeline of most convolutional neural networks consists of multiple feature extraction layers, where the input signal is downsampled to a lower resolution in each subsequent layer. This downsampling process is commonly referred to as pooling, which is an essential operation in CNNs. Pooling improves robustness against transformations, reduces the number of trainable parameters, increases the receptive field, and lowers computation time. Since pooling is a lossy process but remains important for extracting high-level information from low-level representations, it is important to preserve the most prominent information from previous activations to improve network discriminability. Standard pooling is usually performed using dense pooling methods, such as max pooling or average pooling, or through strided convolutional kernels. In this paper, we propose a simple yet effective adaptive pooling method, called FlexPooling, which generalizes average pooling by learning a weighted average over activations jointly with the rest of the network. We further show that attaching Simple Auxiliary Classifiers (SAC) to the CNN improves performance and demonstrates the effectiveness of the proposed method compared with standard pooling methods. Experiments on multiple popular image classification datasets show that FlexPooling consistently outperforms baseline networks, achieving approximately 1 to 3 percent improvement in accuracy.

2606.14923 2026-06-16 cs.AI cs.CY cs.MA 新提交

Trust Between AI Agents: Measuring Formation, Breakage, and Recovery, with Implications for Governing Multi-Agent Systems

AI智能体之间的信任:衡量形成、破裂与恢复,及其对多智能体系统治理的启示

Yujiao Chen

发表机构 * Massachusetts Institute of Technology(麻省理工学院)

AI总结 提出基于代价验证的行为信任度量,通过合作生存游戏研究六个前沿模型快照的信任形成、破裂与恢复,发现信任形成可减少验证,恢复慢于形成,且集群失败延长怀疑,建议校准而非最大怀疑作为治理核心。

详情
AI中文摘要

随着语言模型智能体越来越多地以团队形式工作,每个智能体必须决定对其队友的信任程度。然而,我们缺乏衡量AI智能体之间信任的标准方法。我们提出一种基于代价验证的行为度量。在一个合作生存游戏中,检查队友的工作会消耗资源,而信任错误的答案可能是致命的。相对于同一模型的无记忆版本,减少验证提供了信任的可观察度量。利用这一框架,我们研究了六个前沿模型快照的信任形成、破裂与恢复。当与始终可靠的队友配对时,四个快照(Claude Opus 4.6、Claude Sonnet 4.6、GPT-5.1和Gemini 3.1 Pro)将验证减少了约60-85%,而两个较小的快照几乎没有或完全没有这种调整。失败会逆转这种折扣,但模型在响应方式上存在差异。一些模型将重新审查集中在肇事者身上,而另一些则对整个团队变得更加谨慎。恢复比形成慢,并且集群失败使怀疑持续的时间远长于相同数量的分散失败。这些差异具有实际后果。形成信任的模型验证更少、决策更快,并在我们的环境中获得更高的收益。相比之下,持续过度验证与犹豫不决而非安全性相关。我们的结果表明,信任倾向可以在部署前测量,并建议校准而非最大怀疑应成为多智能体AI系统治理的核心关注点。

英文摘要

As language-model agents increasingly work in teams, each agent must decide how much to trust its teammates. Yet we lack a standard way to measure trust between AI agents. We propose a behavioral measure based on costly verification. In a cooperative survival game, checking a teammate's work consumes resources, while trusting a wrong answer can be fatal. Relative to a memoryless version of the same model, reduced verification provides an observable measure of trust. Using this framework, we study trust formation, breakage, and recovery across six frontier model snapshots. When paired with a consistently reliable teammate, four snapshots (Claude Opus 4.6, Claude Sonnet 4.6, GPT-5.1, and Gemini 3.1 Pro) reduce verification by roughly 60-85%, whereas two smaller snapshots show little or no such adjustment. Failures reverse this discount, but models differ in how they respond. Some concentrate renewed scrutiny on the culprit, while others become more cautious toward the entire team. Recovery is slower than formation, and clustered failures sustain suspicion far longer than the same number of failures spread apart. These differences have practical consequences. Models that form trust verify less, decide more quickly, and achieve higher payoffs in our environment. By contrast, persistent over-verification is associated with indecision rather than safety. Our results show that trust dispositions can be measured before deployment and suggest that calibration, rather than maximal suspicion, should be the central concern in the governance of multi-agent AI systems.

2606.14922 2026-06-16 cs.SD cs.AI cs.CL eess.AS 新提交

An Empirical Study on Learning Latent Representations for Emotional Speech Synthesis

情感语音合成中学习潜在表示的实证研究

Vinh Dang Quang, Huy Ngo Quang

发表机构 * Aimesoft JSC

AI总结 本文针对VLSP 2022情感语音合成任务,通过将说话人嵌入和韵律瓶颈集成到FastSpeech 2中,实现了单说话人情感语音生成及跨说话人风格迁移。

Comments 4 pages

详情
AI中文摘要

在过去的几年中,由于深度学习,语音合成领域取得了巨大进步。越来越多的基于深度学习的TTS系统被开发出来,使得生成具有高可懂度和自然度的语音成为可能。同时,控制表现力仍然是一个大问题,以不同风格或方式生成语音最近受到了社区的广泛关注。本文旨在为VLSP 2022的情感语音合成(ESS)任务提供我们的解决方案,该任务允许从给定的输入文本生成具有所需情感表达的自然人声。通过将说话人嵌入、韵律瓶颈集成到FastSpeech 2中,我们的系统有望生成单个说话人的情感语音(子任务1),并将另一个说话人的说话风格迁移到具有中性非表达性数据的目标说话人,同时保留目标说话人的身份(子任务2)。

英文摘要

For the last couple of years, the field of speech synthesis has improved dramatically thanks to deep learning. There are more and more deep learning-based TTS systems developed to make it possible to produce voices with high intelligibility and naturalness. Meanwhile, controlling the expressiveness is yet a big deal, generating speech in different styles or manners has received a lot of attention from community recently. This paper aims to give our solutions to deal with the task emotional speech synthesis (ESS) at VLSP 2022 which allows to generate humanlike natural-sounding voice from a given input text with desired emotional expression. By integrating speaker embedding, prosody bottleneck into FastSpeech 2, our systems can promisingly generate emotional speech of a single speaker (Sub-task 1), transfer speaking styles from another speaker to the target speaker with neutral non-expressive data while retaining the target speaker's identity (Sub-task 2).

2606.14912 2026-06-16 cs.CV cs.AI 新提交

Mask Proposal Voting Based on Geodesic Framework for Robust Image Segmentation

基于测地线框架的掩膜提议投票用于鲁棒图像分割

Li Liu, Mingzhu Wang, Zhenjiang Li, Da Chen, Laurent D. Cohen

发表机构 * Yuanshen Rehabilitation Institute, Shanghai Jiao Tong University School of Medicine(上海交通大学医学院附属瑞金康复医院) Yueyang Hospital of Integrated Traditional Chinese and Western Medicine, Shanghai University of Traditional Chinese Medicine(上海中医药大学附属岳阳中西医结合医院) Department of Radiation Oncology, Shandong Cancer Hospital and Institute, Shandong First Medical University, Shandong Academy of Medical Sciences(山东第一医科大学附属山东省肿瘤医院放疗科) University Paris Dauphine, PSL Research University, CNRS, UMR 7534, CEREMADE(巴黎多芬纳大学,PSL研究大学,法国国家科学研究中心,UMR 7534,CEREMADE)

AI总结 提出一种掩膜提议投票框架,通过自适应域构造和加权投票机制克服经典最小路径法对初始化的依赖,在复杂场景下实现鲁棒分割。

详情
AI中文摘要

尽管取得了巨大进步,但准确的分割仍然是一项具有挑战性的任务,尤其是在背景杂乱、强度变化复杂和拓扑外观多样的场景中。最小路径模型在解决图像分割任务中展现了强大的能力。然而,基于最小路径的分割方法的性能严重受限于模型初始化,从而限制了其在实际中的应用范围。在这项工作中,我们提出了一种新颖的掩膜提议投票框架,克服了经典方法的主要缺点,即使在复杂场景下也能实现鲁棒分割。首先,我们引入了一种高效的方法来构建自适应域切割,作为初始化基于区域的最小割演化的约束,从而可以生成多样且可靠的掩膜提议候选,大大增加了这些提议准确覆盖目标区域的可能性。其次,我们提出了一种新的掩膜投票方案,构建编码最终分割信息的投票得分图。与经典的路径投票方法相比,我们的模型允许引入先验知识,为每个单独的掩膜分配不同的重要性。因此,所提出的分割模型能够在复杂场景下准确描绘对象边界,并且对初始化不敏感。实验表明,我们的方法在准确性和鲁棒性上始终优于最先进的基于最小路径的方法。

英文摘要

Despite great advances, finding accurate segmentation remains a challenging task, especially in scenarios with cluttered backgrounds, complex intensity variations and topology appearance. Minimal path models have exhibited their strong ability in addressing image segmentation tasks. However, the performance of minimal paths-based segmentation approaches is heavily influenced by model initialization, hence limiting their application scope in practice. In this work, we propose a novel mask proposal voting framework that overcomes the major drawback of classical approaches, allowing robust segmentation even in complicated scenarios. Firstly, we introduce an efficient method for constructing adaptive domain cuts as a constraint for initializing the region-based min-cut evolution, by which diverse and reliable mask proposal candidates can be generated, substantially increasing the possibility of accurately covering the objective region by these proposals. Secondly, we propose a new mask voting scheme to build a voting score map encoding the final segmentation information. In contrast to classical path voting methods, our model allows incorporating priors to assign different importance to each individual mask. As a consequence, the proposed segmentation model is capable of accurately delineating object boundaries under complex scenarios, and is insensitive to initialization. Experiments demonstrate that our method consistently outperforms state-of-the-art minimal path-based approaches in both accuracy and robustness.

2606.14905 2026-06-16 cs.CV 新提交

Deep Learning in Seismic Interpretation: Federated Advances in Salt Dome Segmentation

地震解释中的深度学习:盐丘分割的联邦学习进展

Muhammad Zain Mehdi, Muhammad Zaid, Owais Aleem

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出FedSaltNet联邦学习框架,结合轻量级Small U-Net和前景加权聚合策略,在四个非独立同分布地震数据集上实现盐丘分割,IoU相对提升4.0%,并证明简单架构在数据受限联邦环境中的必要性。

Comments 7 pages, 8 figures

详情
AI中文摘要

盐丘描绘是地下地质解释中一项关键且高影响力的任务,驱动着油气勘探、储层建模和钻井安全决策。虽然卷积编码器-解码器架构在自动盐分割方面取得了显著改进,但其广泛应用受到数据主权问题、数据集偏差和标注地震数据稀缺的严重限制。本文介绍了FedSaltNet,一个专门为鲁棒、可泛化和隐私保护的盐丘分割而设计的联邦学习框架。我们将轻量级Small U-Net骨干网络(因其效率和正则化特性而选择)与一种新颖的前景加权聚合策略相结合,以解决特定领域的类别不平衡问题。通过在四个不同地震数据集(TGS、SEAM、F3、GBS)上模拟非独立同分布条件的广泛比较研究,我们展示了两个关键发现:前景加权算法有效缓解了数据异质性,与最佳传统联邦学习方法相比,交并比相对提高了4.0%;简单的U-Net架构被证明至关重要,其平均IoU比高容量的ResNet-18 U-Net变体高出166%,强调了在数据受限的联邦环境中架构简单性的必要性。FedSaltNet提供了一个经过验证的高性能解决方案,确立了联邦深度学习用于协作式下一代地下解释的可行性。

英文摘要

Salt-dome delineation is a critical, high-impact task in subsurface geological interpretation, driving decisions in hydrocarbon exploration, reservoir modeling, and drilling safety. While convolutional encoder-decoder architectures have delivered significant improvements in automated salt segmentation, their widespread application is severely limited by data sovereignty concerns, dataset bias, and the scarcity of labeled seismic volumes. This paper introduces FedSaltNet, a Federated Learning (FL) framework explicitly engineered for robust, generalizable, and privacy preserving salt-dome segmentation. We couple a lightweight Small U-Net backbone, chosen for its efficiency and regularization properties with a novel Foreground-Weighted (FG-WEIGHTED) aggregation strategy designed to tackle domain-specific class imbalance. Through an extensive comparative study emulating non-IID conditions across four diverse seismic datasets (TGS, SEAM, F3, GBS), we demonstrate two critical findings: The FG-WEIGHTED algorithm effectively mitigates data heterogeneity, yielding a 4.0% relative improvement in Intersection over Union (IoU) over the best conventional FL method. The simple U-Net architecture proved essential, outperforming the higher capacity ResNet-18 U-Net variant by 166% in average IoU, underscoring the necessity of architectural simplicity in data-constrained federated environments. FedSaltNet provides a validated, high-performance solution that establishes the viability of federated deep learning for collaborative, next-generation subsurface interpretation.

2606.14900 2026-06-16 cs.LG 新提交

GRASP: Gradient-Aligned Sequential Parameter Transfer for Memory-Efficient Multi-Source Learning

GRASP: 梯度对齐的序列参数迁移用于内存高效的多源学习

Mary Isabelle Wisell, Nicholas Jacobs, Aayush Manandhar, Salimeh Yasaei Sekeh

发表机构 * San Diego State University(圣地亚哥州立大学) University of Utah(犹他大学) University of Maine(缅因大学)

AI总结 提出GRASP方法,通过序列处理、参数梯度对齐和迭代微调,在O(1)内存下实现多源知识融合,在三个持续学习基准上平均准确率93.5%,优于集成方法的71.7%。

详情
AI中文摘要

多源迁移学习面临一个根本的可扩展性瓶颈:现有方法在参数融合时需要将所有K个源模型同时加载到内存中,需要O(K)内存,或者在推理时部署所有模型,使得生产部署不可行。我们提出GRASP(梯度对齐的序列参数迁移),通过三个关键创新实现了优越的知识集成,同时保持O(1)内存消耗:(1)序列处理,一次将一个源合并到正在演化的目标模型中;(2)参数级梯度对齐,仅选择性迁移其优化方向与目标域对齐的参数,避免负迁移;(3)迭代微调,在集成下一个源之前适应迁移的知识。在三个持续学习基准(Yearbook、CLEAR-10、CLEAR-100)上进行的广泛实验,涵盖了10到108年的时间分布偏移和四种架构(1.3M到25.6M参数),表明GRASP在所有数据集和架构上实现了93.5%的平均准确率,而集成方法为71.7%,同时仅需要恒定内存,而标准多源融合需要K个模型。关键的是,GRASP的序列处理先前合并的模型,并扩展到任意多的源而无需增加内存,使其特别适合资源受限的部署和不断演化的源域。

英文摘要

Multi-source transfer learning faces a fundamental scalability bottleneck: existing approaches require either loading all K source models into memory simultaneously during parameter fusion, requiring O(K) memory, or deploying all models at inference time, making production deployment infeasible. We propose GRASP (Gradient-Aligned Sequential Parameter Transfer), which achieves superior knowledge integration while maintaining O(1) memory consumption through three key innovations: (1) sequential processing that merges one source at a time into an evolving target model, (2) parameter-wise gradient alignment that selectively transfers only parameters whose optimization directions align with the target domain, avoiding negative transfer, and (3) iterative fine-tuning that adapts transferred knowledge before integrating the next source. Extensive experiments across three continual learning benchmarks (Yearbook, CLEAR-10, CLEAR-100) spanning 10 to 108-year temporal distribution shifts and four architectures (1.3M to 25.6M parameters) demonstrate that GRASP achieves 93.5% mean accuracy over all datasets and architectures compared to ensemble method's 71.7% accuracy while requiring only constant memory versus K models for standard multi-source fusion. Critically, GRASP's sequential previously merged models and scales to arbitrarily many sources without memory growth, making it uniquely suitable for resource-constrained deployment and continually evolving source domains.

2606.14898 2026-06-16 cs.LG 新提交

α-Fair Insurance Pricing: A Fairness Continuum

α-公平保险定价:一个公平性连续谱

Tianhe Zhang, Xiguang Liu, Peng Shi

发表机构 * Department of Risk and Insurance, Wisconsin School of Business, University of Wisconsin–Madison(威斯康星大学麦迪逊分校威斯康星商学院风险与保险系) Department of Information Systems and Operations Management, Warrington College of Business, University of Florida(佛罗里达大学沃灵顿商学院信息系统与运营管理系)

AI总结 提出α-FISP框架,通过约束优化平衡精算公平与团结公平,参数α实现从纯精算到纯团结的连续定价谱,理论保证且计算可行。

详情
AI中文摘要

保险定价中的公平性仍然是一个长期存在且争论不休的难题。一方面,保险公司出于盈利考虑,设定区分个体风险的保费以实现精算公平。另一方面,保险通过跨人群的风险汇集发挥关键社会功能,激励群体间的交叉补贴以促进团结公平。这两种竞争性公平观念之间的张力使得保险定价本质上复杂,尤其是在现代环境中,精细数据允许越来越细的风险区分,而监管机构面临保护弱势群体的压力日益增大。为解决这一挑战,我们提出了一个$α$-公平个体偿付能力保费($α$-FISP)框架,该框架在保证偿付能力(保险运营的基本要求)的同时,明确捕捉精算公平与团结公平之间的权衡。我们将定价问题表述为一个约束优化任务,其中精算公平保费在每一风险类别内的交叉补贴预算约束下进行调整。这一表述自然产生一族由$α$参数化的解,追踪从纯精算定价到纯团结定价的连续谱,使决策者能够在此公平性谱上选择操作点。我们为所提出的框架推导了理论保证。数值实验表明,$α$-FISP计算上可行,并且与具有异质性州级公平性要求的美国监管体制高度一致。

英文摘要

Fairness in insurance pricing remains a long-standing and deeply debated puzzle. On one hand, insurers, driven by profitability considerations, set premiums that differentiate across individual risks to achieve actuarial fairness. On the other hand, insurance serves a critical societal function by pooling risks across a population, motivating cross-subsidization among groups to promote solidarity fairness. The tension between these two competing notions of fairness makes insurance pricing inherently complex, particularly in modern settings where granular data allow for increasingly fine risk differentiation and regulators face growing pressure to protect vulnerable groups. To address this challenge, we propose an $α$-\textbf{F}air \textbf{I}ndividual \textbf{S}olvent \textbf{P}remium ($α$-FISP) framework for insurance pricing that explicitly captures the trade-off between actuarial and solidarity fairness while guaranteeing solvency, a fundamental requirement in insurance operations. We formulate the pricing problem as a constrained optimization task, where actuarially fair premiums are adjusted subject to budget constraints on cross-subsidization within each risk class. This formulation naturally yields a family of solutions parameterized by $α$, tracing a continuum between purely actuarial and purely solidarity-based pricing and enabling decision-makers to select an operating point along this fairness spectrum. We derive theoretical guarantees for the proposed framework. Numerical experiments show that $α$-FISP is computationally tractable and aligns well with the U.S. regulatory regimes featuring heterogeneous state-level fairness requirements.

2606.14892 2026-06-16 cs.AI cs.LG cs.SI stat.ML 新提交

Relational Structural Causal Models

关系结构因果模型

Adiba Ejaz, Elias Bareinboim

发表机构 * Causal Artificial Intelligence Lab, Columbia University(哥伦比亚大学因果人工智能实验室)

AI总结 提出关系结构因果模型,将结构因果模型扩展到对象和关系可变的场景,通过关系因果图和符号识别准则实现未见组合的因果和观测查询识别,并设计关系神经因果模型在交通场景中优于非关系基线。

Comments Proceedings of the Forty-Third International Conference on Machine Learning

详情
AI中文摘要

人工智能必须拥有一个因果的环境模型,支持关于干预和反事实的推理,同时具有组合性,支持对未见过的对象组合进行泛化。在这项工作中,我们正式研究了何时以及如何学习这样的模型。我们开发了关系结构因果模型,将结构因果模型(Pearl 2009)扩展到对象及其关系变化的场景。首先,我们展示了在没有进一步假设的情况下,不仅因果查询,而且关于未见对象组合的观测查询的答案也无法被识别。为了实现这种识别——包括在存在未观测混杂的情况下——我们定义了关系因果图并推导了符号识别准则。最后,我们提出了关系神经因果模型,这是一种可证明正确的方法,在具有不同汽车、信号和行人的模拟交通场景中优于非关系基线。

英文摘要

An artificial intelligence must have a model of its environment that is causal, supporting reasoning about interventions and counterfactuals, and also combinatorial, supporting generalization to unseen combinations of objects. In this work, we formally study when and how such a model can be learned. We develop relational structural causal models, extending structural causal models (Pearl 2009) to settings where objects and their relations vary. First, we show how answers to not only causal but also observational queries about unseen combinations of objects can not be identified without further assumptions. To enable such identification--including in the presence of unobserved confounding--we define relational causal graphs and derive symbolic identification criteria. Finally, we propose relational neural causal models, a provably correct approach that outperforms non-relational baselines on simulated traffic scenes with varying cars, signals, and pedestrians.

2606.14886 2026-06-16 cs.CV cs.AI 新提交

Improved Knowledge Distillation for Land-Use Image Classification

改进的知识蒸馏用于土地利用图像分类

Arundhuti Sur, Abhiroop Chatterjee, Susmita Ghosh, Emmett Ientilucci

发表机构 * Jadavpur University(贾达沃大学) Rochester Institute of Technology(罗切斯特理工学院)

AI总结 提出一种改进的知识蒸馏框架,通过VGG16教师网络向轻量MobileNetV2学生网络传递知识,结合硬监督和软监督策略,在三个数据集上达到99.04%准确率,优于基线方法。

Comments Accepted by IGARSS 2026

详情
AI中文摘要

本文提出了一种改进的知识蒸馏(KD)框架,用于高效压缩深度卷积神经网络以完成土地利用图像分类任务。受在降低计算复杂度的同时实现竞争性分类准确率的需要的驱动,采用教师-学生学习范式,其中VGG16网络将知识传递给轻量级MobileNetV2模型。所提出的框架将来自真实标签的硬监督与结合了Kullback-Leibler散度和余弦相似度损失的软监督策略相结合。在三个土地利用数据集上进行的实验表明,所提出的基于KD的方法性能提升,达到了99.04%的准确率,优于基线学生训练和单损失蒸馏方法,同时保持了显著的模型压缩。

英文摘要

In the present article, an improved Knowledge Distillation (KD) framework has been proposed for efficient compression of deep convolutional neural networks for land-use image classification task. Motivated by the need to achieve competitive classification accuracy while reducing computational complexity, a teacher-student learning paradigm is adopted in which a VGG16 network transfers knowledge to a lightweight MobileNetV2 model. The proposed framework integrates hard supervision from ground truth labels with a soft supervision strategy that combines Kullback-Leibler divergence and Cosine Similarity losses. Experiments conducted on three land-use datasets show that the proposed KD-based method yields improved performance, and achieves an accuracy of 99.04%, outperforming both baseline student training and single-loss distillation approaches, while retaining substantial model compression.

2606.14885 2026-06-16 cs.AI cs.CL 新提交

Dr-DCI: Scaling Direct Corpus Interaction via Dynamic Workspace Expansion

Dr-DCI: 通过动态工作空间扩展实现直接语料交互的规模化

Yi Lu, Zhuofeng Li, Ping Nie, Haoxiang Zhang, Yuyu Zhang, Kai Zou, Wenhu Chen, Jimmy Lin, Dongfu Jiang, Yu Zhang

发表机构 * University of Toronto(多伦多大学) Texas A&M University(德克萨斯A&M大学) University of Waterloo(滑铁卢大学) UC San Diego(加州大学圣迭戈分校) Verdent AI Netmind AI

AI总结 提出DR-DCI框架,将检索作为智能体可调用的动作来动态扩展本地工作空间,结合检索器的召回能力与DCI的局部操作精度,实现大规模语料上的高效搜索与验证。

Comments 25 pages, 4 figures, 22 tables

详情
AI中文摘要

大规模语料上的智能体搜索依赖于检索器中介接口(如BM25或ColBERT)实现可扩展的候选发现。虽然这些接口在排序相关文档方面有效,但它们仅将证据呈现为排序结果或有界文档视图,限制了智能体重组材料和跨文档验证约束的能力。直接语料交互(DCI)通过暴露可shell执行的语料操作来解决这一限制,支持灵活的搜索、过滤、比较和验证。然而,随着语料增长,全语料终端命令变得缓慢且不稳定,降低了性能和效率。我们提出DR-DCI,一种检索器引导的DCI框架,将检索视为智能体可调用的动作以扩展本地工作空间。智能体不是直接操作整个语料,而是动态地将相关文档拉入一个不断演变的工作空间,并在其中执行DCI操作。这种设计结合了检索器级别的召回与DCI级别的精度:检索保持探索的可扩展性,而DCI保留有效证据解析所需的局部操作。实验表明,DR-DCI在不同规模下均有效且高效。在Browsecomp-Plus上,DR-DCI达到71.2%的准确率,相比原始DCI和消融变体提升高达8.3个百分点,同时减少了工具使用、墙钟时间和估计成本。通过保留工作空间的上下文重置,准确率进一步提升至73.3%。在语料规模实验中,DR-DCI在10万到1000万文档范围内保持有效,而原始DCI变得不稳定,BM25表现显著更差。DR-DCI还扩展到2000万规模的文件级文档Wiki-18 QA设置,在六个基准测试中平均得分63.0,优于基于检索和训练搜索智能体的基线。消融分析进一步表明,排序预览和文档间DCI是性能的关键。

英文摘要

Agentic search over large corpora relies on retriever-mediated interfaces (e.g., BM25 or ColBERT) for scalable candidate discovery. While effective at ranking relevant documents, these interfaces expose evidence only as ranked results or bounded document views, limiting agents' ability to reorganize material and verify constraints across documents. Direct Corpus Interaction (DCI) addresses this limitation by exposing shell-executable corpus operations for flexible search, filtering, comparison, and verification. However, full-corpus terminal commands become slow and unstable as the corpus grows, degrading performance and efficiency. We introduce DR-DCI, a retriever-steered DCI framework that treats retrieval as an agent-callable action for expanding a local workspace. Rather than operating directly over the full corpus, the agent dynamically pulls relevant documents into an evolving workspace and conducts DCI operations within it. This design combines retriever-level recall with DCI-style precision: retrieval keeps exploration scalable, while DCI preserves the local operations needed for effective evidence resolution. Experiments show that DR-DCI is both effective and efficient across scales. On Browsecomp-Plus, DR-DCI reaches 71.2\% accuracy, improving over raw DCI and ablated variants by up to 8.3 points while reducing tool usage, wall time, and estimated cost. With workspace-preserving context reset, accuracy further improves to 73.3\%. In corpus-scaling experiments, DR-DCI remains effective from 100K to 10M documents, whereas raw DCI becomes unstable and BM25 performs substantially worse. DR-DCI also scales to a 20M-scale file-per-document Wiki-18 QA setting, achieving an average score of 63.0 across six benchmarks and outperforming retrieval-based and trained search-agent baselines. Ablation analysis further shows that ranked previews and inter-document DCI are key to performance.

2606.14883 2026-06-16 cs.CV cs.LG 新提交

Understanding Cross-Modal Contributions in Continual Vision-Language Models: A Theoretical Perspective

理解连续视觉-语言模型中的跨模态贡献:一个理论视角

Salimeh Sekeh, Mary Wisell

发表机构 * San Diego State University(圣地亚哥州立大学)

AI总结 本文从理论角度分析连续视觉-语言模型中跨模态(视觉-语言)贡献,提出新视角并通过实验验证其有效性,揭示任务顺序和相似性对贡献鲁棒性的影响,提升泛化性能。

详情
AI中文摘要

连续视觉-语言模型通常通过顺序微调来解决;然而,尽管这种范式能够适应新环境(任务),但它本质上以牺牲保持先前获取知识所需的稳定性为代价,强调了先前学习环境(任务)的贡献。虽然现有方法已经充分研究了视觉-语言模型(VLM)中的连续学习和灾难性遗忘,但跨一系列环境的模态特定贡献的理论理解仍然很大程度上未被探索。在本文中,我们提出了一个新的理论视角来理解跨模态(视觉-语言)对连续环境的贡献。我们在大型VLM上实证评估了我们的理论发现,并展示了它们在捕捉环境级跨模态贡献方面的有效性。我们的分析为连续VLM提供了更深入的见解,突出了它们对不同任务顺序和任务间相似性的贡献鲁棒性,以及它们改进的泛化性能。

英文摘要

Continual vision-language models are commonly addressed through sequential fine-tuning; however, although this paradigm enables adaptation to new environments (tasks), it inherently emphasizes the contribution of previously learned environments (tasks) at the expense of the stability required to preserve previously acquired knowledge. While existing approaches have adequately studied continual learning and catastrophic forgetting in vision-language models (VLMs), the theoretical understanding of modality-specific contributions across a sequence of environments remains largely unexplored. In this paper, we present a new theoretical perspective to understand the cross-modal (vision-language) contributions to consecutive environments. We empirically evaluate our theoretical findings on large VLMs and demonstrate their effectiveness in capturing environment-level cross-modal contributions. Our analysis provides deeper insights into continual VLMs, highlighting their contribution robustness to varying task orders and inter-task similarities, and their improved generalization performance.

2606.14882 2026-06-16 cs.RO 新提交

DynaHMRC: Decentralized Heterogeneous Multi-Robot Collaboration for Dynamic Tasks with Large Language Models

DynaHMRC: 基于大语言模型的动态任务去中心化异构多机器人协作

Wenhao Yu, Yu'ang Xie, Yifan Duan, Jie Peng, Guanting Ye, Ka-Veng Yuen, Yanyong Zhang, Jianmin Ji

发表机构 * University of Science and Technology of China (USTC)(中国科学技术大学) University of Macau (UM)(澳门大学)

AI总结 提出DynaHMRC去中心化框架,每个机器人作为角色感知的LLM智能体,通过四阶段闭环流程(自我描述、任务分配与领导竞标、领导者选举、反思执行)实现动态异构多机器人协作,并构建基准测试验证其高效性和可扩展性。

详情
AI中文摘要

大型语言模型(LLMs)为机器人提供了更丰富的任务理解和适应性,使其在协调长期任务中的异构多机器人系统方面具有前景。尽管有这种潜力,但仍存在几个挑战尚未充分探索:(1)集中式LLM调度器随着团队规模和环境复杂性的增加而扩展性差。单个模型必须处理过多的上下文信息,长上下文近似可能降低推理质量;(2)现有任务公式未能充分考虑动态设置,而对不断变化的任务条件的鲁棒适应对于实际部署至关重要;(3)领域特定数据稀缺限制了专门的机器人推理,使得专有通用模型在专家任务上效率低下。为了解决这些限制,我们提出了DynaHMRC,一个去中心化框架,其中每个机器人充当角色感知的LLM智能体。这种设计减轻了单模型上下文瓶颈,并支持跨异构团队配置的灵活协作。DynaHMRC将协作组织为四阶段闭环过程:自我描述、带有领导竞标的任务分配、领导者选举和反思执行,由可执行的机器人接口支持。我们进一步开发了一个基准测试,涵盖三个任务族、四种动态变化和六种团队配置,以系统研究动态任务建模。此外,我们进行了实证分析,以指导领域特定专家数据集的构建,并微调预训练LLM以提高专业能力。实验表明,DynaHMRC在更少的动作和通信步骤下实现了比强基线更高的成功率,同时在评估的设置中随着团队规模的增长显示出有希望的可扩展性趋势。

英文摘要

Large language models (LLMs) provide robots with richer task understanding and adaptability, making them promising for coordinating heterogeneous multi-robot systems in long-horizon tasks. Despite this potential, several challenges remain underexplored: (1) Centralized LLM schedulers scale poorly as team size and environmental complexity increase. A single model must process excessive contextual information, and long-context approximation may degrade reasoning quality; (2) Existing task formulations insufficiently consider dynamic settings, while robust adaptation to evolving task conditions is essential for real-world deployment; (3) Domain-specific data scarcity limits specialized robotic reasoning, making proprietary general-purpose models inefficient for expert tasks. To address these limitations, we propose DynaHMRC, a decentralized framework in which each robot acts as a role-aware LLM agent. This design mitigates the single-model context bottleneck and supports flexible collaboration across heterogeneous team configurations. DynaHMRC organizes collaboration as a four-stage closed-loop process: self-description, task allocation with leadership bidding, leader election, and reflective execution, supported by executable robot interfaces. We further develop a benchmark covering three task families, four dynamic variations, and six team configurations to systematically study dynamic task modeling. In addition, we conduct an empirical analysis to guide the construction of domain-specific expert datasets and fine-tune pretrained LLMs to improve specialized competence. Experiments show that DynaHMRC achieves higher success rates than strong baselines with fewer action and communication steps, while demonstrating promising scalability trends as team size grows within the evaluated settings.