2603.08290 2026-05-19 cs.LG cs.AI

Minor First, Major Last: A Depth-Induced Implicit Bias of Sharpness-Aware Minimization

先浅后深：一种由深度诱导的sharpness-aware minimization的隐式偏见

Chaewon Moon, Dongkuk Si, Chulhee Yun

发表机构 * Graduate School of AI, KAIST（韩国成均馆大学人工智能研究生院）； Mobilint, Inc.（Mobilint公司）

AI总结该研究探讨了在训练线性可分二分类问题时，sharpness-aware minimization (SAM) 的隐式偏见，发现对于深度L=2的情况，SAM的行为与深度L=1时不同，展示了sequential feature amplification现象。

Comments Accepted to ICLR 2026, 84 pages, 35 figures

详情

AI中文摘要

我们研究了在训练L层线性对角网络时，sharpness-aware minimization (SAM) 的隐式偏见。对于线性模型（L=1），ℓ∞-SAM和ℓ2-SAM都能恢复ℓ2最大间隔分类器，与梯度下降（GD）一致。然而，对于深度L=2，行为发生剧烈变化——即使在单例数据集上。对于ℓ∞-SAM，极限方向依赖于初始化，并可能收敛到零向量或任何标准基向量，与GD的极限方向形成鲜明对比。对于ℓ2-SAM，我们证明其极限方向与GD的ℓ1最大间隔解一致，但有限时间动态表现出我们称之为“顺序特征放大”的现象，即预测器最初依赖于次要坐标，然后逐渐转向更大的坐标。我们的理论分析将这种现象归因于ℓ2-SAM在扰动中应用的梯度归一化因子，该因子在早期放大次要坐标，允许主要坐标在后期主导。合成和真实数据实验验证了我们的发现。

英文摘要

We study the implicit bias of Sharpness-Aware Minimization (SAM) when training $L$-layer linear diagonal networks on linearly separable binary classification. For linear models ($L=1$), both $\ell_\infty$- and $\ell_2$-SAM recover the $\ell_2$ max-margin classifier, matching gradient descent (GD). However, for depth $L = 2$, the behavior changes drastically -- even on a single-example dataset. For $\ell_\infty$-SAM, the limit direction depends critically on initialization and can converge to $\mathbf{0}$ or to any standard basis vector, in stark contrast to GD, whose limit aligns with the basis vector of the dominant data coordinate. For $\ell_2$-SAM, we show that although its limit direction matches the $\ell_1$ max-margin solution as in the case of GD, its finite-time dynamics exhibit a phenomenon we call "sequential feature amplification", in which the predictor initially relies on minor coordinates and gradually shifts to larger ones as training proceeds or initialization increases. Our theoretical analysis attributes this phenomenon to $\ell_2$-SAM's gradient normalization factor applied in its perturbation, which amplifies minor coordinates early and allows major ones to dominate later, giving a concrete example where infinite-time implicit-bias analyses are insufficient. Synthetic and real-data experiments corroborate our findings.

URL PDF HTML ☆

赞 0 踩 0

2603.07900 2026-05-19 cs.AI

EveryQuery: Zero-Shot Clinical Prediction via Task-Conditioned Pretraining over Electronic Health Records

EveryQuery: 通过电子健康记录上的任务条件预训练实现零样本临床预测

Payal Chandak, Gregory Kondas, Liat Antwarg Friedman, Isaac Kohane, Matthew McDermott

发表机构 * Harvard-MIT HST（哈佛-麻省理工学院HST）； Columbia University（哥伦比亚大学）； Harvard Medical School（哈佛医学院）

AI总结本文提出EveryQuery，一种通过任务条件预训练实现零样本临床预测的电子健康记录基础模型，通过直接估计未来窗口内结果发生的可能性，而非生成未来事件，从而在多个预测任务中优于自回归基线模型。

详情

AI中文摘要

在电子健康记录（EHR）上预训练的基础模型已通过生成合成患者未来和聚合采样轨迹的统计信息，展示了零样本临床预测能力。然而，这种自回归推理过程计算成本高、统计噪声大且不支持直接提示条件预测，因为用户无法直接根据特定临床问题条件预测。在本初步工作中，我们引入EveryQuery，一种EHR基础模型，通过任务条件预训练实现零样本推理。不同于生成未来事件，EveryQuery输入患者的历史和一个结构化的查询指定临床任务，并通过单次前向传递直接估计未来窗口内结果发生的可能性。EveryQuery通过在随机采样的查询任务和患者上下文中预训练，直接训练模型以产生正确的答案。这使得无需微调、线性探测或轨迹生成即可对查询空间中的任何任务进行零样本预测。在MIMIC-IV上，EveryQuery在82%的39个随机采样的预测任务中优于自回归基线模型，平均AUC提高+0.16（95%置信区间：[0.10,0.22]）。这一优势在明确从预训练分布中排除的任务中保持一致。此外，EveryQuery的性能提升在罕见临床事件上最为显著，证实并展示了自回归推理在低预发率结果方面的根本限制的解决方案。然而，目前EveryQuery在需要对多个代码进行离散推理的任务上表现欠佳，如30天再入院，暴露了当前查询语言的表达性限制。

英文摘要

Foundation models pretrained on electronic health records (EHR) have demonstrated zero-shot clinical prediction capabilities by generating synthetic patient futures and aggregating statistics over sampled trajectories. However, this autoregressive inference procedure is computationally expensive, statistically noisy, and not natively promptable because users cannot directly condition predictions on specific clinical questions. In this preliminary work, we introduce EveryQuery, an EHR foundation model that achieves zero-shot inference through task-conditioned pre-training. Rather than generating future events, EveryQuery takes as input a patient's history and a structured query specifying a clinical task, and directly estimates the likelihood of the outcome occurring in the future window via a single forward pass. EveryQuery realizes this capability by pre-training over randomly sampled combinations of query tasks and patient contexts, directly training the model to produce correct answers to arbitrary input prompts. This enables zero-shot prediction for any task in the query space without finetuning, linear probing, or trajectory generation. On MIMIC-IV, EveryQuery outperforms an autoregressive baseline on 82% of 39 randomly sampled prediction tasks, with a mean AUC improvement of +0.16 (95% CI: [0.10,0.22]). This advantage remains consistent on tasks that were explicitly held out from the pre-training distribution. Further, EveryQuery's performance gains are most pronounced for rare clinical events, affirming and demonstrating a solution to the fundamental limitation of autoregressive inference for low-prevalence outcomes. However, at present, EveryQuery underperforms on tasks requiring disjunctive reasoning over multiple codes, such as 30-day readmission, exposing a concrete expressiveness limitation of the current query language.

URL PDF HTML ☆

赞 0 踩 0

2603.04727 2026-05-19 cs.CV cs.AI

Are Multimodal LLMs Ready for Surveillance? A Reality Check on Zero-Shot Anomaly Detection in the Wild

多模态大语言模型是否准备好用于监控？对零样本异常检测在现实中的检验

Shanle Yao, Armin Danesh Pazho, Narges Rashvand, Hamed Tabkhi

发表机构 * Electrical and Computer Engineering Department（电气与计算机工程系）

AI总结本文研究了多模态大语言模型在现实中的零样本异常检测性能，发现其存在保守偏差，通过特定指令可以提升F1分数，但召回率仍是关键瓶颈。

详情

AI中文摘要

多模态大语言模型（MLLMs）在视频理解方面展示了出色的通用能力，但其在现实中的视频异常检测（VAD）可靠性仍待探索。与传统依赖重建或姿态线索的流程不同，MLLMs实现了将异常检测视为语言引导推理任务的范式转变。本文通过将VAD重新表述为二分类任务，在弱时间监督下系统评估了最先进的MLLMs在ShanghaiTech和CHAD基准上的性能。我们研究了提示特异性及时间窗口长度（1s-3s）对性能的影响，重点分析精度-召回率的权衡。研究发现，在零样本设置中存在显著的保守偏差；尽管模型表现出高置信度，但倾向于选择'正常'类，导致高精度但召回率崩溃，限制了实际应用。我们证明，针对类别的特定指令可显著改变这一决策边界，使ShanghaiTech的峰值F1分数从0.09提升至0.64，但召回率仍是关键瓶颈。这些结果突显了MLLMs在嘈杂环境中的显著性能差距，并为未来在召回导向提示和模型校准方面的研究提供了基础，这对需要复杂视频理解和推理的开放世界监控任务提出了要求。

英文摘要

Multimodal large language models (MLLMs) have demonstrated impressive general competence in video understanding, yet their reliability for real-world Video Anomaly Detection (VAD) remains largely unexplored. Unlike conventional pipelines relying on reconstruction or pose-based cues, MLLMs enable a paradigm shift: treating anomaly detection as a language-guided reasoning task. In this work, we systematically evaluate state-of-the-art MLLMs on the ShanghaiTech and CHAD benchmarks by reformulating VAD as a binary classification task under weak temporal supervision. We investigate how prompt specificity and temporal window lengths (1s--3s) influence performance, focusing on the precision--recall trade-off. Our findings reveal a pronounced conservative bias in zero-shot settings; while models exhibit high confidence, they disproportionately favor the 'normal' class, resulting in high precision but a recall collapse that limits practical utility. We demonstrate that class-specific instructions can significantly shift this decision boundary, improving the peak F1-score on ShanghaiTech from 0.09 to 0.64, yet recall remains a critical bottleneck. These results highlight a significant performance gap for MLLMs in noisy environments and provide a foundation for future work in recall-oriented prompting and model calibration for open-world surveillance, which demands complex video understanding and reasoning.

URL PDF HTML ☆

赞 0 踩 0

2603.04161 2026-05-19 cs.CL

Traces of Social Competence in Large Language Models

大语言模型中社会能力的踪迹

Tom Kouwenhoven, Michiel van der Meer, Max van Duijn

发表机构 * Leiden Institute of Advanced Computer Science（莱顿先进计算机科学研究所）； Leiden University（莱顿大学）

AI总结本文研究了大语言模型在虚假信念测试中的表现，通过贝叶斯逻辑回归分析模型大小和训练方法对社会认知能力的影响，发现模型规模扩大有助于性能提升，但并非绝对，同时指出解释命题态度会改变响应模式，进一步的推理导向微调会加剧这种影响。

Comments Presented at the 2026 Conference on Computational Natural Language Learning (CoNLL)

详情

AI中文摘要

虚假信念测试（FBT）一直是评估理论自我（ToM）及相关社会认知能力的主要方法。对于大语言模型（LLMs），由于数据污染、模型细节不足和控制不一致等问题，该测试的可靠性和解释潜力一直有限。我们通过在192个FBT变体（Trott等人，2023）的平衡数据集上测试17个开源模型，并使用贝叶斯逻辑回归来识别模型大小和训练后对社会认知能力的影响。我们发现模型规模扩大有助于性能提升，但并非严格正比。交叉效应显示，解释命题态度（X thinks）根本上改变了响应模式。指令微调部分缓解了这种影响，但进一步的推理导向微调会加剧这种影响。在分析OLMo 2训练过程中社会推理能力的案例研究中，我们发现这种交叉效应出现在预训练阶段，表明模型在预训练过程中获取了与心理状态词汇相关的刻板响应模式，这些模式可能超过其他情境语义。最后，向量引导使我们能够将think向量作为观察到的FBT行为的因果驱动因素。

英文摘要

The False Belief Test (FBT) has been the main method for assessing Theory of Mind (ToM) and related socio-cognitive competencies. For Large Language Models (LLMs), the reliability and explanatory potential of this test have remained limited due to issues like data contamination, insufficient model details, and inconsistent controls. We address these issues by testing 17 open-weight models on a balanced set of 192 FBT variants (Trott et al., 2023) using Bayesian Logistic regression to identify how model size and post-training affect socio-cognitive competence. We find that scaling model size benefits performance, but not strictly. A cross-over effect reveals that explicating propositional attitudes (X thinks) fundamentally alters response patterns. Instruction tuning partially mitigates this effect, but further reasoning-oriented fine-tuning amplifies it. In a case study analysing social reasoning ability throughout OLMo 2 training, we show that this cross-over effect emerges during pre-training, suggesting that models acquire stereotypical response patterns tied to mental-state vocabulary that can outweigh other scenario semantics. Finally, vector steering allows us to isolate a think vector as the causal driver of observed FBT behaviour.

URL PDF HTML ☆

赞 0 踩 0

2602.22941 2026-05-19 cs.CV

Velocity and stroke rate reconstruction of canoe sprint team boats based on panned and zoomed video recordings

基于平移和缩放视频记录的皮划艇冲刺团队船只速度和划桨率重建

Julian Ziegler, Daniel Matthes, Finn Gerdts, Patrick Frenzel, Torsten Warnke, Matthias Englert, Tina Koevari, Mirco Fuchs

发表机构 * Laboratory for Biosignal Processing, Leipzig University of Applied Sciences, Leipzig, Germany（生物信号处理实验室，莱比锡应用科学大学，莱比锡，德国）； Research Group Canoeing, Institute for Applied Training Science (IAT), Leipzig, Germany（划船研究组，应用训练科学研究所（IAT），莱比锡，德国）； German Canoe Federation, Duisburg, Germany（德国皮划艇联合会，杜伊斯堡，德国）

AI总结本文提出了一种基于平移和缩放视频记录重建皮划艇冲刺团队船只速度和划桨率的方法，利用YOLOv8检测浮标和运动员，结合已知的浮标网格估计同源性，通过U-Net进行船体校准以估计船的位置，并利用光流实现鲁棒跟踪，最终提取划桨率信息，实验结果表明其速度和划桨率的MAPE分别达到0.011和0.009，具有高精度和自动化反馈。

详情

AI中文摘要

节奏策略，由速度和划桨率曲线定义，对于皮划艇冲刺的峰值表现至关重要。尽管GPS是分析的黄金标准，但其有限的可用性需要自动化视频分析方法。本文提出了一种扩展框架，用于从平移和缩放的视频记录中重建所有冲刺项目（K1-K4，C1-C2）和距离（200m-500m）的性能指标。我们的方法利用YOLOv8进行浮标和运动员检测，利用已知的浮标网格估计同源性。我们通过学习特定船体的运动员偏移量来一般化估计船的位置，利用U-Net进行船体校准。进一步，我们通过光流实现鲁棒的跟踪方案以适应多运动员船体类型。最后，我们介绍了从姿态估计或运动员边界框本身提取划桨率信息的方法。与精英比赛GPS数据的评估显示，速度的MAPE为0.011 [0.008 0.014]（Spearman rho=0.974）和划桨率的MAPE为0.009 [0.006 0.013]（Spearman rho=0.975）。这些方法为教练提供了高精度、自动化的反馈，且无需传感器，仅需极少的手动初始化工作。

英文摘要

Pacing strategies, defined by velocity and stroke rate profiles, are essential for peak performance in canoe sprint. While GPS is the gold standard for analysis, its limited availability necessitates automated video-based solutions. This paper presents an extended framework for reconstructing performance metrics from panned and zoomed video recordings across all sprint disciplines (K1-K4, C1-C2) and distances (200m-500m). Our method utilizes YOLOv8 for buoy and athlete detection, leveraging the known buoy grid to estimate homographies. We generalized the estimation of the boat position by means of learning a boat-specific athlete offset using a U-net based boat tip calibration. Further, we implement a robust tracking scheme using optical flow to adapt to multi-athlete boat types. Finally, we introduce methods to extract stroke rate information from either pose estimations or the athlete bounding boxes themselves. Evaluation against GPS data from elite competitions yields a velocity MAPE of 0.011 [0.008 0.014] (Spearman rho=0.974) and a stroke rate MAPE of 0.009 [0.006 0.013] (Spearman rho = 0.975). The methods provide coaches with highly accurate, automated feedback with minimal manual initialization work required, and without requiring sensors.

URL PDF HTML ☆

赞 0 踩 0

2602.18217 2026-05-19 cs.CL

Information-Theoretic Storage Cost in Sentence Comprehension

句法理解中的信息论存储成本

Kohei Kajikawa, Shinnosuke Isono, Ethan Gotlieb Wilcox

发表机构 * Department of Linguistics, Georgetown University, USA（地缘政治学系，乔治城大学，美国）； National Institute for Japanese Language and Linguistics, Japan（日本语言学研究院，日本）

AI总结本文提出了一种基于信息论的存储成本度量方法，用于评估句法理解过程中上下文信息的存储需求，通过神经语言模型估计该成本，并在英语中验证了其在中心嵌套和相对从句中的处理不对称性，以及在阅读时间变异预测中的有效性。

Comments Accepted to CoNLL 2026

详情

AI中文摘要

实时句法理解对工作记忆施加了显著负担，因为理解者必须维护上下文信息以预测未来输入。尽管对这种负担的测量在心理语言学理论中起到了重要作用，但它们主要通过符号语法形式化，将句法预测分配为离散且均匀的成本。本研究提出了一种基于信息论形式化的处理存储成本度量，作为先前词语对未来上下文信息的携带量，在不确定性下的度量。与之前的离散、基于语法的度量不同，这种度量是连续的、概率性的、理论中立的，并且可以从预训练的神经语言模型中估计。通过三种英语分析验证了该方法的有效性：我们的度量（i）恢复了已知的中心嵌套和相对从句中的处理不对称性，（ii）与一个语法标注语料库中的基于语法的存储成本相关联，（iii）在两个大规模自然主义数据集中预测阅读时间变异，这在传统信息基础预测器之上。我们的代码可在https://github.com/kohei-kaji/info-storage获取。

英文摘要

Real-time sentence comprehension imposes a significant load on working memory, as comprehenders must maintain contextual information to anticipate future input. While measures of such load have played an important role in psycholinguistic theories, they have largely been formalized using symbolic grammars, which assign discrete, uniform costs to syntactic predictions. This study proposes a measure of processing storage cost based on an information-theoretic formalization, as the amount of information previous words carry about future context, under uncertainty. Unlike previous discrete, grammar-based metrics, this measure is continuous, probabilistic, theory-neutral, and can be estimated from pre-trained neural language models. The validity of this approach is demonstrated through three analyses in English: our measure (i) recovers well-known processing asymmetries in center embeddings and relative clauses, (ii) correlates with a grammar-based storage cost in a syntactically-annotated corpus, and (iii) predicts reading-time variance in two large-scale naturalistic datasets over and above baseline models with traditional information-based predictors. Our code is available at https://github.com/kohei-kaji/info-storage.

URL PDF HTML ☆

赞 0 踩 0

2602.12703 2026-05-19 cs.LG

具有密集权重的神经网络不是通用逼近器

Levi Rauchwerger, Stefanie Jegelka, Ron Levie

发表机构 * Princeton University, Dept of CS（普林斯顿大学计算机科学系）； MIT, Dept of EECS and CSAIL（麻省理工学院电子工程与计算机科学系及计算机科学与人工智能实验室）； TUM, School of CIT, MCML, MDSI（技术大学（TUM）信息科技学院，MCML，MDSI）； Technion – IIT, Faculty of Mathematics（技术学院–以色列理工学院数学学院）

AI总结研究探讨了密集神经网络的逼近能力，指出在有限的权重约束下，密集连接的神经网络无法逼近任意连续函数，从而揭示了密集层神经网络的固有局限性，推动了稀疏连接在实现真正通用性中的必要性。

2602.06866 2026-05-19 cs.LG

T-STAR: A Context-Aware Transformer Framework for Short-Term Probabilistic Demand Forecasting in Dock-Based Shared Micro-Mobility

T-STAR: 一种基于上下文的Transformer框架用于基于码头的共享微出行短期概率需求预测

Jingyi Cheng, Gonçalo Homem de Almeida Correia, Oded Cats, Shadi Sharif Azadeh

发表机构 * Transport and Planning, Delft University of Technology（代尔夫特理工大学交通与规划）

AI总结本文提出T-STAR框架，通过两级结构分离一致需求模式和短期波动，提升短期概率需求预测的准确性，实验表明其在确定性和概率性准确性上均优于现有方法，且具备良好的时空鲁棒性。

Comments This work has been submitted to Transportation Research Part C

详情

AI中文摘要

可靠的短期需求预测对于管理共享微出行服务和确保响应、以用户为中心的操作至关重要。本文介绍了T-STAR（Two-stage Spatial and Temporal Adaptive contextual Representation），一种新的基于Transformer的概率框架，旨在以15分钟的分辨率预测车站级自行车共享需求。T-STAR通过分层两级结构解决高分辨率预测中的关键挑战，第一阶段捕捉粗粒度的小时需求模式，第二阶段通过整合高频、本地化的输入（包括近期波动和实时需求变化）提高预测精度，以考虑短期需求的时间转移。时间序列Transformer模型用于两个阶段生成概率预测。使用华盛顿特区的Capitol Bikeshare数据的广泛实验表明，T-STAR在确定性和概率性准确性上均优于现有方法。该模型在车站和时间期间表现出强大的时空鲁棒性。零样本预测实验进一步展示了T-STAR在无需重新训练的情况下能够转移到以前未见过的服务区域的能力。这些结果凸显了该框架在提供细粒度、可靠且不确定性的短期需求预测方面的潜力，从而无缝整合以支持多模式出行规划，提高共享微出行服务的实时操作能力。

英文摘要

Reliable short-term demand forecasting is essential for managing shared micro-mobility services and ensuring responsive, user-centered operations. This study introduces T-STAR (Two-stage Spatial and Temporal Adaptive contextual Representation), a novel transformer-based probabilistic framework designed to forecast station-level bike-sharing demand at a 15-minute resolution. T-STAR addresses key challenges in high-resolution forecasting by disentangling consistent demand patterns from short-term fluctuations through a hierarchical two-stage structure. The first stage captures coarse-grained hourly demand patterns, while the second stage improves prediction accuracy by incorporating high-frequency, localized inputs, including recent fluctuations and real-time demand variations in connected metro services, to account for temporal shifts in short-term demand. Time series transformer models are employed in both stages to generate probabilistic predictions. Extensive experiments using Washington D.C.'s Capital Bikeshare data demonstrate that T-STAR outperforms existing methods in both deterministic and probabilistic accuracy. The model exhibits strong spatial and temporal robustness across stations and time periods. A zero-shot forecasting experiment further highlights T-STAR's ability to transfer to previously unseen service areas without retraining. These results underscore the framework's potential to deliver granular, reliable, and uncertainty-aware short-term demand forecasts, which enable seamless integration to support multimodal trip planning for travelers and enhance real-time operations in shared micro-mobility services.

URL PDF HTML ☆

赞 0 踩 0

2602.05156 2026-05-19 cs.RO cs.SY eess.SY

PLATO Hand: Shaping Contact Behavior with Fingernails for Precise Manipulation

PLATO Hand：利用指甲形状接触行为实现精确操控

Dong Ho Kang, Aaron Kim, Mingyo Seo, Kazuto Yokoyama, Tetsuya Narita, Luis Sentis

发表机构 * The University of Texas at Austin（德克萨斯大学奥斯汀分校）； Sony Group Corporation（索尼集团）

AI总结本文提出PLATO手，一种具有混合指尖的灵活机器人手，通过结合刚性指甲、嵌入式远节指骨和顺应性肉垫，实现接触行为的塑造。研究开发了基于应变能的弯曲-压入模型，指导指尖设计并解释材料刚度和接触几何如何控制指尖变形分配。实验显示提升了捏合稳定性、指甲介导的背侧接触力传输和本体感觉可观察性，并成功执行了敏感边缘操控任务，如纸张分隔、卡片拾取和橙子剥皮。这些结果表明，结合机械结构的接触界面与力-运动透明手指机制提供了精确操控的原理性方法。

详情

AI中文摘要

我们提出了PLATO手，一种具有混合指尖的灵活机器人手，该指尖结合了刚性指甲、嵌入式远节指骨和顺应性肉垫，以在操控过程中塑造接触行为。通过机械组织指尖接触的启动、支撑和传递方式，这种结构在多样化的物体几何形状和抓取方向上创造了稳定且任务相关的接触条件。我们开发了基于应变能的弯曲-压入模型，以指导指尖设计并解释材料刚度和接触几何如何控制指尖内的变形分配。实验显示提升了捏合稳定性、指甲介导的背侧接触力传输和本体感觉可观察性，并成功执行了敏感边缘操控任务，包括纸张分隔、卡片拾取和橙子剥皮。这些结果表明，结合机械结构的接触界面与力-运动透明手指机制提供了精确操控的原理性方法。我们的项目页面是：https://platohand.github.io

英文摘要

We present the PLATO Hand, a dexterous robotic hand with a hybrid fingertip that combines a rigid fingernail, embedded distal phalanx, and compliant pulp to shape contact behavior during manipulation. \rrev{By mechanically organizing how contact is initiated, supported, and transmitted at the fingertip, this structure creates stable and task-relevant contact conditions across diverse object geometries and grasp orientations.} We develop a strain-energy-based bending--indentation model to guide the fingertip design and to explain how material stiffness and contact geometry govern deformation partitioning within the fingertip. \rrev{Experiments show improved pinch stability, improved fingernail-mediated dorsal-contact force transmission and proprioceptive observability}, and successful execution of edge-sensitive manipulation tasks, including paper singulation, card picking, and orange peeling. These results show that coupling a mechanically structured contact interface with a force-motion-transparent finger mechanism provides a principled approach to precise manipulation. Our project page is at: https://platohand.github.io

URL PDF HTML ☆

赞 0 踩 0

2602.03797 2026-05-19 cs.LG

深度网络中最小权重扰动的理论及其在低秩激活后门攻击中的应用

Bethan Evans, Jared Tanner

发表机构 * Department of Mathematics, University of Oxford, Oxford, UK（牛津大学数学系）

AI总结本文推导了深度网络实现指定输出变化所需的最小范数权重扰动，并讨论了其大小决定因素，同时将其应用于精度修改激活的后门攻击，确定了攻击成功的压缩阈值，并展示了低秩压缩可以在保持全精度准确性的同时可靠激活潜在后门。

2601.14330 2026-05-19 cs.CV cs.LG

YawDD+: 用于准确打哈欠预测的帧级标注

Ahmed Mujtaba, Gleb Radchenko, Marc Masana, Radu Prodan

发表机构 * Embedded Systems Division, Silicon Austria Labs（Silicon Austria Labs嵌入式系统部门）； Institute of Visual Computing, Graz University of Technology（格拉茨技术大学视觉计算研究所）； Department of Computer Science, University of Innsbruck（因斯布鲁克大学计算机科学系）

AI总结本文提出了一种半自动化标注流程，通过人工在循环验证来标注YawDD视频以获得更准确的帧级标注，从而在边缘设备上提升模型训练效果，实现更高效的疲劳驾驶检测。

Comments This paper is accepted in the 33rd IEEE International Conference on Image Processing (ICIP) 2026

详情

AI中文摘要

驾驶员疲劳仍然是道路事故的主要原因，导致24%的碰撞事故。尽管打哈欠是疲劳的早期行为指标，但现有方法面临挑战，因为视频标注数据集中存在系统性噪声，源于粗略的时间标注。训练稳健的机器学习（ML）模型需要丰富的监督标签，以帮助从训练数据中学习显著特征。此外，在边缘设备上高效训练和推断模型对于疲劳驾驶检测任务至关重要，以在不依赖云基础设施的情况下实现车辆上的准确实时决策。为了解决这个问题，我们开发了一种半自动标注流程，通过人工在循环验证来标注YawDD视频以获得更准确的帧级标注，从而在边缘平台如NVIDIA Jetson NANO上更准确地训练模型。在YawDD+上训练已建立的MNasNet分类器和YOLOv11检测器架构，比视频级监督提高了多达6%的帧准确率和5%的mAP，分别在Jetson NANO和AGX上实现了99.34%的分类准确率和95.69%的检测mAP。此外，MNasNet在AGX上仅用8.69分钟/epoch完成一个周期，同时提供高达115帧/秒（FPS）的推断时间，证明了增强的数据质量本身支持边缘设备上的驾驶员疲劳监测系统，而无需服务器端计算。YawDD+数据集和训练好的模型已在线上提供。

英文摘要

Driver fatigue remains a leading cause of road accidents, responsible for 24% of crashes. While yawning serves as an early behavioral indicator of fatigue, existing approaches face significant challenges due to the presence of systematic noise in video-annotated datasets arising from coarse temporal annotations. Training robust machine learning (ML) models requires rich supervisory labels that help learn salient features from the training data. Moreover, efficient on-device training and inference of models on edge devices is crucial in driver fatigue detection tasks to enable accurate real-time decisions on vehicles without reliance on cloud infrastructure. To address this issue, we develop a semi-automated labeling pipeline with human-in-the-loop verification to annotate YawDD videos to YawDD+ frame-level annotations, enabling more accurate model training on edge platforms such as NVIDIA Jetson NANO. Training the established MNasNet classifier and YOLOv11 detector architectures on YawDD+ improves frame accuracy by up to 6% and mAP by 5% over video-level supervision, achieving 99.34% classification accuracy and 95.69% detection mAP on Jetson NANO and AGX. Moreover, MNasNet completed the epoch time in just 8.69 min/epoch while delivering up to 115 frames-per-second (FPS) inference time on AGX, confirming that enhanced data quality alone supports on-device driver fatigue monitoring systems without server-side computation. The YawDD+ dataset and trained models are available online.

URL PDF HTML ☆

赞 0 踩 0

AI 大模型

视觉与机器人

科学与医疗

DiffWind: Physics-Informed Differentiable Modeling of Wind-Driven Object Dynamics

YOLO-NAS-Bench: A Surrogate Benchmark with Self-Evolving Predictors for YOLO Architecture Search

Reasoning as Compression: Unifying Budget Forcing via the Conditional Information Bottleneck

Minor First, Major Last: A Depth-Induced Implicit Bias of Sharpness-Aware Minimization

EveryQuery: Zero-Shot Clinical Prediction via Task-Conditioned Pretraining over Electronic Health Records

Are Multimodal LLMs Ready for Surveillance? A Reality Check on Zero-Shot Anomaly Detection in the Wild

Traces of Social Competence in Large Language Models

Velocity and stroke rate reconstruction of canoe sprint team boats based on panned and zoomed video recordings

Information-Theoretic Storage Cost in Sentence Comprehension

SWING: Unlocking Implicit Graph Representations for Graph Random Features

Disentangling Ambiguity from Instability in Large Language Models: A Clinical Text-to-SQL Case Study

Beyond Accuracy: Decomposing the Reasoning Efficiency of LLMs

Geospatial-Reasoning-Driven Vocabulary-Agnostic Remote Sensing Semantic Segmentation

Neural Networks With Dense Weights Are Not Universal Approximators

T-STAR: A Context-Aware Transformer Framework for Short-Term Probabilistic Demand Forecasting in Dock-Based Shared Micro-Mobility

PLATO Hand: Shaping Contact Behavior with Fingernails for Precise Manipulation

Manifold Random Features

Mitigating Conversational Inertia in Multi-Turn Agents

SynCABEL: Synthetic Contextualized Augmentation for Biomedical Entity Linking

Theory of Minimal Weight Perturbations in Deep Networks and its Applications for Low-Rank Activated Backdoor Attacks

LURE: Latent Space Unblocking for Multi-Concept Reawakening in Diffusion Models

DisasterVQA: A Visual Question Answering Benchmark Dataset for Disaster Scenes

Parallelizable memory recurrent units

Speech-Hands: A Self-Reflection Voice Agentic Approach to Speech Recognition and Audio Reasoning with Omni Perception

PersonaDual: Balancing Personalization and Objectivity via Adaptive Reasoning

Rethinking Table Pruning in TableQA: From Sequential Revisions to Gold Trajectory-Supervised Parallel Search

Lying with Truths: Open-Channel Multi-Agent Collusion for Belief Manipulation via Generative Montage

Symmetry Matters: Auditing and Symmetrizing 3D Generative Models

Learning under Distributional Drift: Prequential Reproducibility as an Intrinsic Statistical Resource

YawDD+: Frame-level Annotations for Accurate Yawn Prediction