arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 1938
专题追踪
2605.21299 2026-05-21 cs.CL cs.AI

Tracing the ongoing emergence of human-like reasoning in Large Language Models

追踪大型语言模型中类人推理的持续涌现

Paolo Morosi, Nikoleta Pantelidou, Fritz Günther, Elena Pagliarini, Evelina Leivada

发表机构 * Departament de Filologia Catalana, Universitat Autònoma de Barcelona(加泰罗尼亚语言系系,巴塞罗那自治大学) Institut für Psychologie, Humboldt-Universitat zu Berlin(柏林洪堡大学心理学研究所) Institució Catalana de Recerca i Estudis Avançats (ICREA)(加泰罗尼亚高级研究与高级教育研究所)

AI总结 研究探讨了大型语言模型在条件推理任务中是否具备类人推理能力,发现人类通过语用推理丰富逻辑推理,而模型行为更不稳定,部分模型遵循条件语义但忽视语用推理,表明LLMs在语义准确性上表现良好,但缺乏人类推理中的语用丰富性。

详情
AI中文摘要

人类能够超越字面意义:如果你修剪草坪,我会给你五十美元,通常被理解为说话者只在草坪修剪时支付,而如果你饿了,烤箱里有披萨,意味着披萨无论听者是否饥饿都可用。大型语言模型(LLMs)在许多任务上表现出类人性能,但尚不清楚它们是否像人类一样推理。为此,我们进行了一项人口匹配实验,评估了25个LLMs在四种语言中计算条件推理的能力,并与每种语言中等数量的人类进行比较。我们发现,人类通过跨语言的语用推理丰富逻辑推理。模型行为更具变异性。一些LLMs完全遵循条件语义的真值表,但忽视语用推理,而另一些LLMs偏离真值表,坚持单一解释,从而反映准确的规则处理但不具有类人推理能力。总体而言,LLMs是准确的语义运算符,但未能捕捉到人类推理中特有的语用丰富性。关键的是,LLM的准确性既不被开放与封闭状态、训练方向或架构类型所预测或提升,表明语用推理仍然是人工系统认知工具包中正在兴起的能力。

英文摘要

Humans effortlessly go beyond literal meanings: If you mow the lawn, I will give you fifty dollars, is typically understood as implying that the speaker will pay only if the lawn is mowed, whereas If you are hungry, there is pizza in the oven implies that pizza is available regardless of the hearers hunger. Large Language Models - LLMs - show human-like performance on many tasks, yet it remains unclear whether they reason like humans. To address this, we conducted a population-matching experiment assessing how twentyfive LLMs compute conditional inferences across four languages, compared to an equal number of humans per language. We find that humans enrich logical reasoning through pragmatic inferences across languages. Model behavior is more variable. Some LLMs perfectly follow the truth-table of conditionals but they ignore pragmatic inferences, while others deviate from the truth-table, adhering to a single interpretation across the board, thus reflecting accurate rule-based processing but not human-like reasoning. Overall, LLMs are accurate semantic operators, but fail to capture the pragmatic enrichments characteristic of human reasoning. Crucially, LLM accuracy is neither predicted nor boosted by open vs. closed status, training orientation, or architecture type, suggesting that pragmatic reasoning is still an emerging ability in the cognitive toolkit of artificial systems.

2605.21295 2026-05-21 cs.LG cs.AI cs.HC

TimeSRL: Generalizable Time-Series Behavioral Modeling via Semantic RL-Tuned LLMs -- A Case Study in Mental Health

TimeSRL: 通过语义RL调优的LLM实现通用的时间序列行为建模 -- 一项心理健康应用的案例研究

Yuang Fan, Lilin Xu, Millie Wu, Jingping Nie, Qingyu Chen, Yuzhe Yang, Zhuo Zhang, Xin Liu, Subigya Nepal, Xiaofan Jiang, Xuhai "Orson" Xu

发表机构 * Columbia University(哥伦比亚大学) University of North Carolina at Chapel Hill(北卡罗来纳大学教堂山分校) Yale University(耶鲁大学) University of California, Los Angeles(加州大学洛杉矶分校) Google(谷歌) University of Virginia(弗吉尼亚大学)

AI总结 本文提出TimeSRL,一种两阶段LLM框架,通过显式的语义瓶颈路由预测,将原始信号抽象为高级自然语言,从而预测行为结果,该方法在心理健康预测中实现了最先进的跨群体泛化性能。

详情
AI中文摘要

纵向被动传感能够实现连续健康预测,但模型在跨数据集分布偏移下往往失效。传统机器学习容易过拟合群体特异性特征,而大型语言模型(LLMs)在长且异质的时间序列上难以可靠推理。我们引入TimeSRL,一种两阶段LLM框架,通过显式的语义瓶颈路由预测。模型首先将原始信号抽象为高级自然语言,然后仅从这些抽象中预测行为结果。这迫使模型在我们认为泛化更好的语义概念上进行推理。我们通过组相对策略优化(GRPO)结合可验证奖励的强化学习(RLVR)端到端优化这一过程,学习与结果对齐的抽象,而无需金标准中间注释。在心理健康预测中,TimeSRL在设计用于在严格的一留一数据集-out(LOSO)协议下压力测试跨群体泛化能力的基准上实现了最先进的性能,将焦虑的均绝对误差(MAE)在强大的非LLM ML和LLM基线模型上分别降低了3.1-10.1%和9.5-44.1%,抑郁的MAE则降低了3.2-9.6%和27.4-57.6%(所有p值<0.05)。TimeSRL在不同传感管道上的跨基准迁移中显著优于先前方法,在不进行目标领域微调的情况下,其性能与自身在领域内性能相当。这些结果表明语义抽象具有可重用性,并指出了通过RL调优的LLM实现通用行为建模的新方向。

英文摘要

Longitudinal passive sensing enables continuous health prediction, yet models often fail under cross-dataset distribution shifts. Traditional ML overfits cohort-specific artifacts, while Large Language Models (LLMs) struggle to reason reliably over long, heterogeneous time-series. We introduce TimeSRL, a two-stage LLM framework that routes predictions through an explicit semantic bottleneck. The model first abstracts raw signals into high-level natural language, then predicts behavioral outcomes from these abstractions alone. This forces the model to reason over semantic concepts that we argue generalize better than raw numbers. We optimize this process end-to-end using Group Relative Policy Optimization (GRPO) with Reinforcement Learning from Verifiable Rewards (RLVR), learning outcome-aligned abstractions without gold intermediate annotations. Instantiated on mental-health prediction, TimeSRL achieves state-of-the-art performance on a benchmark designed to stress-test cross-cohort generalization under a rigorous leave-one-dataset-out (LOSO) protocol, reducing mean absolute error (MAE) over strong non-LLM ML and LLM baselines by 3.1--10.1% and 9.5--44.1% for anxiety, and 3.2--9.6% and 27.4--57.6% for depression (all $p$s<0.05). TimeSRL significantly outperforms prior methods in cross-benchmark transfer across different sensing pipelines, rivaling its own within-domain performance without target-domain fine-tuning. These results demonstrate that semantic abstractions are reusable and point to a new direction for generalizable behavior modeling via RL-tuned LLMs.

2605.21288 2026-05-21 cs.LG

A Mechanistic Study of Tabular Foundation Models

表格基础模型的机理研究

Marin Biloš, James T. Wilson, Anderson Schneider, Yuriy Nevmyvaka

发表机构 * Morgan Stanley(摩根大通)

AI总结 本文研究了不同架构的表格基础模型在分类和回归任务中的准确性收敛问题,揭示了模型内部算法、对称性来源以及扰动鲁棒性的机理,发现先前指出的表示崩溃并非实际问题。

详情
AI中文摘要

表格基础模型在不同架构下在多种分类和回归任务中表现出准确性的收敛。这引发了排行榜无法回答的问题:(i)这些模型是否执行相同的上下文算法?(ii)行、列和类置换不变性来源在哪里?(iii)在针对推断机制设计的扰动下,它们的鲁棒性如何?我们对这三个问题进行了特征化。模型家族实现了质上不同的相似性基于读取:从加权投票上下文标签到类条件均值读取,每种都通过因果干预得到验证。我们发现先前工作中强调的表示崩溃并非这些模型的实际问题。每个模型的置换不变性可以追溯到特定的位置参数,移除这些参数可保持准确性并使近似不变性变为精确。针对每个读取设计的扰动复现了预测的失败模式;枢纽和排名攻击将它们与重训练基线隔离。这些结果共同提供了当前表格基础模型的机理解释,并识别了哪些归纳偏置同时决定了其准确性和特征性失败。

英文摘要

Tabular foundation models with different architectures converge in accuracy across a range of classification and regression tasks. This raises questions a leaderboard cannot answer: (i) whether the models execute the same in-context algorithm, (ii) where row, column, and class-permutation invariances originate, and (iii) how robust they are under perturbations engineered against the inferred mechanism. We characterize all three. The model families realize qualitatively distinct similarity-based readouts: from an attention-weighted vote over context labels to a class-conditional mean readout, each confirmed by causal intervention. We find that the representation collapse highlighted in prior work is not a practical concern for them. Each model's permutation invariances trace to specific positional parameters whose removal preserves accuracy and makes approximate invariance exact. Perturbations engineered against each readout reproduce predicted failure modes; hub and rank attacks isolate them from refit baselines. Together these results give a mechanistic account of contemporary tabular foundation models and identify which inductive biases govern both their accuracy and characteristic failures.

2605.21280 2026-05-21 cs.CV

Let EEG Models Learn EEG

让EEG模型学习EEG

Yifan Wang, Yijia Ma, Wen Li, Chenyu You

发表机构 * Stony Brook University(石溪大学) University of Texas Health Center at Houston(德克萨斯大学健康中心(休斯顿))

AI总结 本文提出了一种基于条件流匹配的生成框架JET,通过直接建模神经信号的连续演化来生成高质量的EEG信号,解决了传统离散去噪方法在捕捉长期时间依赖性和保持频谱结构方面的不足,实现了在多个基准测试中优于现有方法的性能。

Comments Accepted by ICML 2026

详情
AI中文摘要

高保真度的EEG生成对于缓解大规模神经建模中的数据稀缺和隐私约束至关重要。尽管近年来取得了进展,但大多数现有方法通过离散去噪目标来生成EEG,这无法充分反映神经活动本质上连续的时间动态和频谱结构。因此,这些方法往往难以保持长期时间依赖性,并且生成信号在频谱和时间结构上存在不匹配。在本文中,我们主张有效的EEG生成需要能够直接操作神经信号连续演化的模型。我们引入了Just EEG Transformer (JET),一种基于条件流匹配的生成框架,将EEG建模为沿着连续轨迹演变的原始序列。通过学习一个平滑的向量场,将噪声传输到EEG数据分布,JET在不依赖离散去噪方案或领域特定表示的情况下捕捉时间连续性和瞬态动态。为了确保学习到的动力学与EEG信号的关键属性保持一致,我们引入了保留频谱结构、时间平稳性和信号级统计的原理性约束。在三个大规模基准测试中,JET一致地实现了最先进的性能,相比强大的基线,将TS-FID降低了超过40%。广泛的分析显示,JET捕捉了神经动态的关键结构特性,提供了一种可扩展且原理性的EEG生成方法。项目页面:https://y-research-sbu.github.io/JET/

英文摘要

High-fidelity EEG generation is critical for alleviating data scarcity and addressing privacy constraints in large-scale neural modeling. Despite recent progress, most existing approaches formulate EEG generation via discrete denoising objectives, which inadequately reflect the inherently continuous temporal dynamics and spectral structure of neural activity. As a result, these methods often struggle to preserve long-range temporal dependencies and exhibit mismatches in the spectral and temporal structure of the generated signals. In this work, we argue that effective EEG generation requires models that operate directly on the continuous evolution of neural signals. We introduce Just EEG Transformer (JET), a generative framework based on conditional flow matching that models EEG as raw sequences evolving along continuous trajectories. By learning a smooth vector field that transports noise to the EEG data distribution, JET captures temporal continuity and transient dynamics without relying on discretized denoising schemes or domain-specific representations. To ensure that the learned dynamics remain consistent with key properties of EEG signals, we introduce principled constraints that preserve spectral structure, temporal stationarity, and signal-level statistics. Across three large-scale benchmarks, JET consistently achieves state-of-the-art performance, reducing TS-FID by over 40% compared to strong baselines. Extensive analyses show that JET captures key structural properties of neural dynamics, providing a scalable and principled approach to EEG generation. Project page: https://y-research-sbu.github.io/JET/ .

2605.21272 2026-05-21 cs.CV cs.AI

MONET: A Massive, Open, Non-redundant and Enriched Text-to-image dataset

MONET:一个大规模、开放、非冗余且增强的文本到图像数据集

Benjamin Aubin, Gonzalo Iñaki Quintana, Onur Tasar, Sanjeev Sreetharan, Urszula Czerwinska, Damien Henry, Clément Chadebec

发表机构 * Jasper Research(Jasper研究)

AI总结 本文提出MONET数据集,通过多阶段过滤和增强,提供高质量的文本到图像数据,以降低大规模可重复研究的门槛。

详情
AI中文摘要

训练大型文本到图像模型需要高质量、经过精心编纂的数据集,具有多样内容和详细的描述。然而,收集、过滤、去重和重新描述此类语料库的高昂成本和复杂性阻碍了该领域的开放和可重复研究。我们介绍了MONET,一个开放的Apache 2.0数据集,包含约104.9亿个图像-文本对,这些数据来自29亿个原始对,通过多阶段的安全过滤、领域过滤、精确和近似去重以及使用多种视觉-语言模型重新描述,覆盖短到长形式的描述,并进一步通过合成生成样本增强。每个图像都配有预计算的嵌入和注释,以加速下游使用。为了验证MONET的有效性,我们仅使用它训练了一个400亿参数的潜在扩散模型,并在GenEval和DPG评分中达到了具有竞争力的结果,证明我们的数据集降低了大规模、可重复文本到图像研究的门槛。

英文摘要

Training large text-to-image models requires high-quality, curated datasets with diverse content and detailed captions. Yet the cost and complexity of collecting, filtering, deduplicating, and re-captioning such corpora at scale hinders open and reproducible research in the field. We introduce MONET, an open Apache 2.0 dataset of approx. 104.9M image--text pairs collected from 2.9B raw pairs across heterogeneous open sources through successive stages of safety filtering, domain-based filtering, exact and near-duplicate removal, and re-captioning with multiple vision-language models covering short to long-form descriptions, and further augmented with synthetically generated samples. Each image is shipped with pre-computed embeddings and annotations to accelerate downstream use. To validate the effectiveness of MONET, we train a 4B-parameter latent diffusion model exclusively on it and reach competitive GenEval and DPG scores, demonstrating that our dataset lowers the barrier to large-scale, reproducible text-to-image research.

2605.21266 2026-05-21 cs.LG cs.AI

How Much Online RL is Enough? Informative Rollouts for Offline Preference Optimization in RLVR

在线强化学习需要多少?用于RLVR中离线偏好优化的信息性回放

Richa Verma, Balaraman Ravindran

发表机构 * TCS Research Department of CSE(TCS计算机科学系研究部) IIT Madras(印度理工学院马德拉斯分校) Department of Data Science & AI(数据科学与人工智能系) Wadhwani School of Data Science & AI(Wadhwani数据科学与人工智能学院)

AI总结 本文提出G2D方法,通过短时GRPO预热、构建静态偏好数据集和离线DPO微调,以较低的计算成本实现优于GRPO的性能,强调偏好数据信息性而非数量的重要性。

详情
AI中文摘要

可验证奖励的强化学习(RLVR)已成为语言模型推理的强大范式,GRPO是其主要例子。然而,GRPO需要连续在线回放生成,这使它计算成本高且难以扩展。尽管直接偏好优化(DPO)提供了稳定的离线替代方案,但通常在训练时表现不如在线RL方法如GRPO。我们引入G2D(GRPO到DPO),一个三阶段流程,进行短GRPO预热,构建静态偏好数据集,并使用DPO离线微调模型。在Qwen2.5-7B和Llama-3.1-8B上,我们发现离线DPO在适度预热下能以显著更低的计算成本匹配或超越GRPO。在Qwen2.5-7B上,G2D在K=150时在MATH-500上达到62.4%,比GRPO(51.6%)高出10.8%,计算成本低约4倍。在Llama-3.1-8B上,G2D在K=500时达到49.4%,在实验设置中超越GRPO。我们表明性能不取决于偏好对的数量,而取决于其信息性。适度预热产生校准的不确定性回放,产生更强的对比信号,而过度预热导致过于自信的策略和信息较少的数据。我们的结果将RLVR中的离线-在线差距重新定义为主要的数据信息性问题,并识别了适当难度校准的离线微调数据集的短在线RL预热作为计算高效的在线RL替代方案。

英文摘要

Reinforcement Learning from Verifiable Rewards (RLVR) has emerged as a powerful paradigm for reasoning in language models, with GRPO as its primary example. However, GRPO requires continuous online rollout generation, making it computationally expensive and difficult to scale. While Direct Preference Optimization (DPO) offers a stable and efficient offline alternative, it is typically expected to underperform w.r.t. online RL methods such as GRPO when trained on rollouts from a cold supervised fine-tuned (SFT) policy. We introduce G2D (GRPO to DPO)}, a three-stage pipeline that performs a short GRPO warm-up, constructs a static preference dataset, and fine-tunes a model offline with DPO. Across a set of values of the number of online steps (K) in GRPO on Qwen2.5-7B and Llama-3.1-8B, we find that offline DPO with moderate warm-up matches or outperforms GRPO at substantially lower compute cost in our setting. On Qwen2.5-7B, G2D at K=150 achieves 62.4% on MATH-500, outperforming GRPO (51.6%) by 10.8% at ~4x lower compute. On Llama-3.1-8B, G2D at K=500 achieves 49.4%, surpassing GRPO in our experimental setting. We show that performance is not governed by the number of preference pairs, which does not vary much w.r.t. K, but by their informativeness. Moderate warm-up produces rollouts with calibrated uncertainty, yielding stronger contrastive signal, while excessive warm-up leads to overconfident policies and less informative data. Our results recast the offline-online gap in RLVR as primarily a data informativeness problem, and identify short online RL warm-up with appropriate difficulty calibration of the fine-tuning dataset as a compute-efficient alternative to online RL.

2605.19269 2026-05-21 cs.LG

CODA: Rewriting Transformer Blocks as GEMM-Epilogue Programs

CODA:将Transformer块重写为GEMM-epilogue程序

Han Guo, Jack Zhang, Arjun Menon, Driss Guessous, Vijay Thakkar, Yoon Kim, Tri Dao

发表机构 * Massachusetts Institute of Technology(麻省理工学院) Princeton University(普林斯顿大学) Together AI Meta

AI总结 本文提出CODA,一种将Transformer中的非注意力计算重写为GEMM-epilogue程序的GPU内核抽象,以提高训练效率和硬件利用效率。

详情
AI中文摘要

Transformer训练系统围绕密集线性代数构建,但端到端时间的非平凡分数耗费在周围的内存绑定操作上。归一化、激活、残差更新、减少和相关计算反复在全局内存中移动大型中间张量,同时进行很少的算术运算,使数据移动成为在高度优化的训练堆栈中越来越重要的瓶颈。我们引入CODA,一种GPU内核抽象,将这些计算表示为GEMM-plus-epilogue程序。CODA基于观察到许多Transformer运算作为独立框架内核暴露时,可以重新参数化为在GEMM输出瓷砖留在芯片上执行,然后再写入内存。该抽象固定了GEMM主循环,并暴露了一组小的可组合的epilogue原语用于缩放、减少、配对转换和累积。这种受限的接口保留了专家编写GEMM的性能结构,同时足够表达以覆盖标准Transformer块正向和反向传递中几乎所有非注意力计算。在代表性的Transformer工作负载上,无论是人工还是LLM编写的CODA内核均实现了高性能,表明GEMM-plus-epilogue编程为结合框架级生产力与硬件级效率提供了一条实际路径。

英文摘要

Transformer training systems are built around dense linear algebra, yet a nontrivial fraction of end-to-end time is spent on surrounding memory-bound operators. Normalization, activations, residual updates, reductions, and related computations repeatedly move large intermediate tensors through global memory while performing little arithmetic, making data movement an increasingly important bottleneck in otherwise highly optimized training stacks. We introduce CODA, a GPU kernel abstraction that expresses these computations as GEMM-plus-epilogue programs. CODA is based on the observation that many Transformer operators exposed as separate framework kernels can be algebraically reparameterized to execute while a GEMM output tile remains on chip, before it is written to memory. The abstraction fixes the GEMM mainloop and exposes a small set of composable epilogue primitives for scaling, reductions, pairwise transformations, and accumulation. This constrained interface preserves the performance structure of expert-written GEMMs while remaining expressive enough to cover nearly all non-attention computation in the forward and backward pass of a standard Transformer block. Across representative Transformer workloads, both human- and LLM-authored CODA kernels achieve high performance, suggesting that GEMM-plus-epilogue programming offers a practical path toward combining framework-level productivity with hardware-level efficiency.

2605.17618 2026-05-21 cs.AI

Prediction of Challenging Behaviors Associated with Profound Autism in a Classroom Setting Using Wearable Sensors

使用可穿戴传感器预测课堂环境中与深度自闭症相关的行为问题

Yadhu Kartha, Conor Anderson, Jenny Foster, Theresa Hamlin, Johanna Lantz, Ryan Lay, Juergen Hahn, Gari D. Clifford, Hyeokhyen Kwon

发表机构 * Georgia Institute of Technology(佐治亚理工学院) The Center For Discovery(发现中心) Rensselaer Polytechnic Institute(伦斯勒理工学院) Georgia Institute of Technology, Emory University(佐治亚理工学院与埃默里大学)

AI总结 本研究通过可穿戴传感器和机器学习方法,在真实课堂环境中预测自闭症深度患者的行为问题,展示了在教育环境中提前10分钟预测行为问题的可行性,并实现了AUC-ROC为0.78的准确率。

详情
AI中文摘要

自闭症谱系障碍(ASD)以社交互动和沟通困难以及思维和行为的限制或重复模式为特征,表现具有显著变化性。大约四分之一的ASD儿童被归类为深度自闭症,这些患者常常表现出自我伤害行为、攻击性、逃跑或口欲症等具有严重安全风险的行为,这些行为会干扰教育环境中的学习。先前的工作已应用可穿戴传感器和机器学习来检测这些行为,但大多局限于受控的实验室环境。本工作证明了在真实世界特殊教育课堂中预测这些行为事件是可行的。我们收集了约110.7小时的标记多模态可穿戴数据,包括加速度计、电导活动(EDA)和皮肤温度数据,来自10至21岁的9名儿童和年轻人,在标准课堂会话中。我们微调了最先进的多模态可穿戴时间序列分析基础模型,并展示了可以提前10分钟预测行为事件,AUC-ROC为0.78。这些结果为开发能够帮助教师减少特殊教育课堂中行为问题安全风险的主动干预系统奠定了坚实的基础。

英文摘要

Autism Spectrum Disorder (ASD) is characterized by challenges with social interaction and communication and by restricted or repetitive patterns of thought and behavior, with significant variability in presentation. Approximately a quarter of children with ASD are classified as having profound autism, who often exhibit challenging behaviors, such as self-injurious behavior, aggression, elopement, or pica, that pose serious safety risks and disrupt learning in educational settings. Prior work has applied wearable sensors and machine learning to detect challenging behaviors, but has been largely confined to controlled laboratory environments. This work demonstrates that predicting challenging behavior episodes is feasible in a real-world special education classroom. We collected approximately 110.7 hours of labeled multimodal wearable data comprising accelerometry, electrodermal activity (EDA), and skin temperature from 9 children and young adults aged 10 to 21 years across standard classroom sessions. We fine-tuned state-of-the-art foundation models for multimodal wearable time-series analysis and show that challenging behavior episodes can be predicted up to 10 minutes in advance with an AUC-ROC of 0.78. These results establish a concrete foundation for developing proactive in-class intervention systems that enable teachers to minimize the safety risks of challenging behaviors in special education classrooms

2605.15156 2026-05-21 cs.CL cs.AI cs.LG

MeMo: Memory as a Model

MeMo:记忆作为模型

Ryan Wei Heng Quek, Sanghyuk Lee, Alfred Wei Lun Leong, Arun Verma, Alok Prakash, Nancy F. Chen, Bryan Kian Hsiang Low, Daniela Rus, Armando Solar-Lezama

发表机构 * Institute of Data Science, National University of Singapore(数据科学研究院,新加坡国立大学) Integrative Sciences and Engineering Programme, NUSGS(整合科学与工程计划,NUSGS) Agency for Science, Technology, Research (A*STAR)(科技研究局(A*STAR)) Department of Computer Science, National University of Singapore(计算机科学系,新加坡国立大学) University of Tokyo(东京大学) Liquid AI CSAIL, Massachusetts Institute of Technology(CSAIL,麻省理工学院) AI Singapore Singapore-MIT Alliance for Research and Technology Centre, Singapore(新加坡-麻省理工学院研究与技术中心,新加坡)

AI总结 本文提出MeMo框架,通过在不改变LLM参数的情况下将新知识编码到专用记忆模型中,解决了大型语言模型在需要及时领域特定信息的应用中的问题,同时具备处理复杂跨文档关系、抗检索噪声、避免灾难性遗忘、无需访问LLM权重或输出logits以及检索成本与语料库大小无关等优势。

Comments MeMo augments any LLM with up-to-date or domain-specific knowledge via a trained memory model, avoiding costly retraining, mitigating catastrophic forgetting, and remaining robust to retrieval noise

详情
AI中文摘要

大型语言模型(LLMs)在广泛的任务上表现出色,但预训练后保持冻结状态,直到后续更新。许多现实应用需要及时、领域特定的信息,这促使需要高效的机制来整合新知识。在本文中,我们介绍MeMo(Memory as a Model),一个模块化框架,能够将新知识编码到专用的记忆模型中,同时保持LLM参数不变。与现有方法相比,MeMo具有几个优势:(a)它能够捕捉复杂的跨文档关系;(b)它对检索噪声具有鲁棒性;(c)它避免了LLM中的灾难性遗忘;(d)它不需要访问LLM的权重或输出logits,从而能够与开源和专有闭源LLM进行即插即用式集成;(e)其检索成本在推理时间与语料库大小无关。我们在三个基准测试集BrowseComp-Plus、NarrativeQA和MuSiQue上的实验结果表明,MeMo在多种设置中相比现有方法表现优异。

英文摘要

Large language models (LLMs) achieve strong performance across a wide range of tasks, but remain frozen after pretraining until subsequent updates. Many real-world applications require timely, domain-specific information, motivating the need for efficient mechanisms to incorporate new knowledge. In this paper, we introduce MeMo (Memory as a Model), a modular framework that encodes new knowledge into a dedicated memory model while keeping the LLM parameters unchanged. Compared to existing methods, MeMo offers several advantages: (a) it captures complex cross-document relationships, (b) it is robust to retrieval noise, (c) it avoids catastrophic forgetting in the LLM, (d) it does not require access to the LLM's weights or output logits, enabling plug-and-play integration with both open and proprietary closed-source LLMs, and (e) its retrieval cost is independent of corpus size at inference time. Our experimental results on three benchmarks, BrowseComp-Plus, NarrativeQA, and MuSiQue, show that MeMo achieves strong performance compared to existing methods across diverse settings.

2605.10933 2026-05-21 cs.LG cs.CL

DECO: Sparse Mixture-of-Experts with Dense-Comparable Performance on End-Side Devices

DECO:稀疏专家混合模型在端侧设备上实现密集级性能

Chenyang Song, Weilin Zhao, Xu Han, Chaojun Xiao, Yingfa Chen, Zhiyuan Liu

发表机构 * Dept. of Comp. Sci. & Tech., Institute for AI, Tsinghua University, Beijing, China(计算机科学与技术系,人工智能研究院,清华大学,北京,中国)

AI总结 本文提出DECO,一种稀疏的专家混合模型,能够在相同的总参数预算和训练token数量下,实现与密集Transformer相当的性能,并在端侧设备上实现高效部署。

Comments 15 pages, 10 figures, 12 tables

详情
AI中文摘要

尽管混合专家(MoE)可以在不按比例增加计算的情况下扩展模型容量,但其巨大的总参数足迹造成了显著的存储和内存访问瓶颈,阻碍了同时需要高性能、低计算成本和小存储开销的端侧部署。为了实现这些特性,我们提出了DECO,一种稀疏的MoE架构,旨在在相同的总参数预算和训练token数量下匹配密集Transformer的性能。DECO利用可微且灵活的基于ReLU的路由增强,通过可学习的专家级缩放,自适应地平衡路由和共享专家的贡献。此外,我们引入了NormSiLU,一种在SiLU操作前对输入进行归一化的激活函数,产生更稳定的路由专家激活比率趋势和更高的内在稀疏性水平。我们还发现使用非门控MLP专家与基于ReLU的路由具有经验优势,表明MoE架构可能简化。实验表明,DECO仅激活20%的路由专家,能够实现密集性能,并优于现有的MoE基线。我们的专用加速内核在Jetson AGX Orin上相比密集推理实现了2.93倍的加速。代码和检查点可在https://github.com/thunlp/DECO获取。

英文摘要

While Mixture-of-Experts (MoE) scales model capacity without proportionally increasing computation, its massive total parameter footprint creates significant storage and memory-access bottlenecks, which hinder efficient end-side deployment that simultaneously requires high performance, low computational cost, and small storage overhead. To achieve these properties, we present DECO, a sparse MoE architecture designed to match the performance of dense Transformers under identical total parameter budgets and training tokens. DECO utilizes the differentiable and flexible ReLU-based routing enhanced by learnable expert-wise scaling, which adaptively balances the contributions of routed and shared experts. Furthermore, we introduce NormSiLU, an activation function that normalizes inputs prior to SiLU operators, producing a more stable trend of routed-expert activation ratio and a higher intrinsic sparsity level. We also identify an empirical advantage in using non-gated MLP experts with ReLU-based routing, indicating the possibility of MoE architecture simplification. Experiments demonstrate that DECO, activating only 20% of routed experts, matches dense performance and outperforms established MoE baselines. Our specialized acceleration kernel delivers a 2.93$\times$ speedup on Jetson AGX Orin compared with dense inference. Code and checkpoints are available at https://github.com/thunlp/DECO.

2605.08352 2026-05-21 cs.LG math.PR stat.ML

Convergence Analysis of Newton's Method for Neural Networks in the Overparameterized Limit

神经网络过参数化极限下牛顿方法的收敛性分析

Konstantin Riedl, Konstantinos Spiliopoulos, Justin Sirignano

发表机构 * Mathematical Institute(数学研究所) University of Oxford(牛津大学) Department of Mathematics & Statistics(数学与统计学系) Boston University(波士顿大学)

AI总结 本文研究了在过参数化极限下,正则化牛顿方法训练神经网络的收敛性问题,通过分析牛顿神经切线核(NNTK)的特性,证明了在无限宽极限下,神经网络以指数速度收敛到目标数据,并解决了频谱偏置问题。

详情
AI中文摘要

本文开发了一种正则化牛顿方法用于训练神经网络(NNs)在过参数化极限下的收敛性分析。当隐藏单元数量趋于无穷大时,NN训练动态在概率意义上收敛到一个确定性极限方程的解,该方程涉及一个“牛顿神经切线核”(NNTK)。给出了描述这种收敛的显式速率,并在无限宽度极限下证明NN以指数速度收敛到目标数据(即零损失的全局极小值)。我们证明这种收敛在频谱上是均匀的,解决了梯度下降中的频谱偏置问题。梯度下降的NNTK的特征值聚集在零,导致具有高频分量的目标数据收敛缓慢。相反,如果适当选择正则化参数,NNTK的特征值具有统一的下界,使得牛顿方法能够更快地收敛到具有高频分量的数据。数学上需要解决的问题包括牛顿方法隐式参数更新中可能的不定Hessian矩阵以及随着NN宽度增加,该线性方程组的维度趋于无穷大。这使得在过参数化极限下推导训练动态以及证明有限宽度动态收敛变得复杂。分析确定了一个正则化参数的标度公式,我们证明该公式可以随着隐藏单元数量的增加以合适速率趋于零。我们证明,对于足够大的隐藏单元数量,正则化Hessian在训练过程中保持正定,且NN参数的牛顿更新收敛到零,表明模型行为如同初始化周围的线性化。

英文摘要

A convergence analysis is developed for the regularized Newton method for training neural networks (NNs) in the overparameterized limit. As the number of hidden units tends to infinity, the NN training dynamics converge in probability to the solution of a deterministic limit equation involving a ``Newton neural tangent kernel'' (NNTK). Explicit rates characterizing this convergence are provided and, in the infinite-width limit, we prove that the NN converges exponentially fast to the target data (i.e., a global minimizer with zero loss). We show that this convergence is uniform across the frequency spectrum, addressing the spectral bias inherent in gradient descent. The eigenvalues of the NTK for gradient descent accumulate at zero, leading to slow convergence for target data with high-frequency components. In contrast, the NNTK has uniformly lower bounded eigenvalues if the regularization parameter is selected appropriately, allowing Newton's method to converge more quickly for data with high-frequency components. Mathematical challenges that need to be addressed in our analysis include the implicit parameter update of the Newton method with a potentially indefinite Hessian matrix and the fact that the dimension of this linear system of equations tends to infinity as the NN width grows. This complicates deriving the training dynamics in the overparameterized limit as well as proving the convergence of the finite-width dynamics thereto. The analysis identifies a scaling formula for selecting the regularization parameter, which we show can vanish at a suitable rate as the number of hidden units becomes larger. We prove that, for sufficiently large numbers of hidden units, the regularized Hessian remains positive definite during training and the Newton updates for individual NN parameters converge to zero, showing that the model behaves as a linearization around the initialization.

2605.07892 2026-05-21 cs.LG

Adaptive Regularization for Sparsity Control in Bregman-Based Optimizers

自适应正则化用于Bregman-based优化器中的稀疏性控制

Ahmad Aloradi, Tim Roith, Emanuël A. P. Habets, Daniel Tenbrinck

发表机构 * Department of Data Science, FAU Erlangen-Nürnberg(FAU厄林根-纽伦堡数据科学系) International Audiolabs, FAU Erlangen-Nürnberg(FAU厄林根-纽伦堡国际声学实验室) School of Computation, Information and Technology, Technical University of Munich(慕尼黑技术大学计算、信息与技术学院) Munich Center for Machine Learning (MCML)(慕尼黑机器学习中心)

AI总结 本文提出了一种自适应正则化方法,用于在Bregman-based优化器中更精确地控制稀疏性,通过动态调整正则化参数λ,从而提高稀疏性控制的效率和准确性。

Comments 21 pages, 15 figures

详情
AI中文摘要

稀疏训练可以降低深度神经网络的内存和计算成本。然而,稀疏优化方法,例如添加ℓ1惩罚项的方法,通常通过正则化参数λ间接控制稀疏性,而λ到最终稀疏率的映射是非显式的。在我们的实验中,我们发现这种参数敏感性在Bregman-based优化器中尤为明显。具体来说,LinBreg和AdaBreg两种变体在λ值相差两个数量级时达到相同的稀疏性,需要昂贵的试错扫描来实现用户指定的稀疏性。为了解决这个问题,我们提出了一种自适应正则化方案,根据模型当前稀疏性与目标稀疏性之间的差异来更新λ。我们分析了所得到的算法,并在VoxCeleb和CNCeleb上的自动语音验证任务中评估了该方法,使用ECAPA-TDNN和ResNet34。所提出的方法能够可靠地实现75%到99%的稀疏性目标。它在早期训练中比oracle调优的非自适应基线收敛得更快,并在等误差率上与基线持平或优于基线。我们进一步表明,自适应方案继承了其非自适应对应物的关键特性,包括在密集基线上的改进的分布外鲁棒性。

英文摘要

Sparse training reduces the memory and computational costs of deep neural networks. However, sparse optimization methods, e.g., those adding an $\ell_1$ penalty, often control sparsity only indirectly through a regularization parameter $λ$, whose mapping to the final sparsity rate is non-trivial. In our experiments, we found this parameter sensitivity to be particularly pronounced for Bregman-based optimizers. Specifically, the two variants LinBreg and AdaBreg reach the same sparsity at $λ$ values that differ by up to two orders of magnitude, requiring expensive trial-and-error sweeps to achieve a user-specified sparsity. To address this, we propose an adaptive regularization scheme that updates $λ$ based on the difference between the model's current sparsity and the target sparsity. We analyze the resulting algorithm and evaluate it on automatic speaker verification with ECAPA-TDNN and ResNet34 on VoxCeleb and CNCeleb. The proposed method reliably achieves sparsity targets ranging between 75% and 99%. It also converges faster than the oracle-tuned non-adaptive baseline during early training and matches or surpasses its final performance in equal error rate. We further show that the adaptive scheme inherits key properties from its non-adaptive counterpart, including improved out-of-distribution robustness over the dense baselines.

2605.07816 2026-05-21 cs.CV

ICDAR 2026 Competition on Writer Identification and Pen Classification from Hand-Drawn Circles

ICDAR 2026竞赛:从手绘圆圈中识别作家和笔类

Thomas Gorges, Janne van der Loop, Lukas Hüttner, Linda-Sophie Schneider, Fei Wu, Mathias Seuret, Vincent Christlein

发表机构 * Pattern Recognition Lab, Friedrich-Alexander-Universität Erlangen-Nürnberg(模式识别实验室,弗赖堡-亚历山大-埃朗根-纽伦堡大学) Buchwissenschaft, Johannes Gutenberg-Universität Mainz(书籍学,约翰内斯·古滕贝格大学美因茨)

AI总结 本文提出CircleID竞赛,旨在研究在极小的静态痕迹中,生物特征作家特性和物理笔特征如何自然交织,通过两个任务:开放集作家识别和跨作家笔类分类,评估模型在识别已知作家并拒绝未知作家以及跨已知和未知作家进行笔类分类的能力。

详情
AI中文摘要

本文提出了CircleID,即ICDAR 2026竞赛中关于从扫描的手绘圆圈中进行作家识别和笔类分类的大型竞赛。主要目标是研究生物特征作家特性和物理笔特征在极小的静态痕迹中如何自然交织。CircleID包含两个任务:(1)开放集作家识别,要求模型识别已知作家并明确拒绝未知者;(2)跨作家笔类分类,评估在已见和未见作家之间的表现。参赛者获得了一个新的受控数据集,包含46,155张紧密裁剪的圆圈图像,以400 DPI数字化,并标注了作家身份和笔类型。数据集包含44名已知作家和22名未知作家使用八种不同笔具的样本。该竞赛在Kaggle上作为两个独立赛道进行,设有公开和私人排行榜。竞赛为参赛者提供了ResNet基准线。总计389支队伍(436名参赛者)提交了3,185次笔类分类任务的提交,113支队伍(141名参赛者)提交了1,737次作家识别任务的提交。在私人排行榜上表现最好的提交在作家识别任务中达到了64.801%的Top-1准确率,在笔类分类任务中达到了92.726%的准确率。本文详细介绍了数据集,评估了获胜方法,并分析了非分布作家对模型泛化和特征解耦的影响。在此次大规模竞赛中,CircleID为极小痕迹分析建立了新的基准。

英文摘要

This paper presents CircleID, a large-scale ICDAR 2026 competition on writer identification and pen classification from scanned hand-drawn circles. The primary objective is to investigate how biometric writer characteristics and physical pen features naturally entangle within minimal, static traces. CircleID comprises two distinct tasks: (1) open-set writer identification, requiring models to recognize known writers while explicitly rejecting unknown ones, and (2) cross-writer pen classification, evaluated across both seen and unseen writers. Participants were provided with a new, controlled dataset of 46,155 tightly cropped circle images, digitized at 400 DPI and annotated for writer identity and pen type. The dataset comprises samples from 44 known and 22 unknown writers using eight different pens. Hosted on Kaggle as two separate tracks with public and private leaderboards, the competition provided participants with a ResNet baseline. In total, 389 teams (436 participants) made 3,185 submissions for the pen classification task, and 113 teams (141 participants) made 1,737 submissions for the writer identification track. The best-performing private leaderboard submissions achieved a Top-1 accuracy of 64.801% for writer identification and 92.726% for pen classification. This paper details the dataset, evaluates the winning methodologies, and analyzes the impact of out-of-distribution writers on model generalization and feature disentanglement. In this large-scale competition, CircleID establishes a new baseline for minimal-trace analysis.

2605.07560 2026-05-21 cs.RO

How to Utilize Failure Demo Data?: Effective Data Selection for Imitation Learning Using Distribution Differences in Attention Mechanism

如何利用失败演示数据?:利用注意力机制中的分布差异进行有效的模仿学习数据选择

Kana Miyamoto, Kanata Suzuki, Tetsuya Ogata

发表机构 * Faculty of Science and Engineering, Waseda University, Tokyo, Japan(科学与工程学部,早稻田大学,东京,日本) Physical AI Laboratory, Fujitsu Limited, Kanagawa, Japan(物理人工智能实验室, Fujitsu 有限,神奈川,日本)

AI总结 本文提出了一种利用注意力机制中的分布差异来有效选择模仿学习数据的方法,通过学习成功-失败差异的潜在表示,提高行动稳定性,并引入一个后训练度量来选择有益于学习的失败样本。

Comments 15 pages, 6 figures, 2 tables

详情
AI中文摘要

机器人任务的模仿学习主要依赖于仅在成功演示上训练的策略,尽管在人类数据收集过程中失败是不可避免的。许多现有方法利用失败数据需要额外的数据处理或通过自主回放进行迭代策略更新,这使得难以直接稳定地利用收集期间积累的失败数据。在本工作中,我们提出了一种方法,学习成功-失败差异的潜在表示,并将其纳入注意力机制中。在推理过程中,从初始观察中选择适当的潜在模式以提高行动稳定性。此外,我们引入了一个后训练度量,用于量化每个失败样本与成功演示之间的注意力差异,以选择失败数据。模拟结果表明,当使用失败数据训练时,所提出的方法提高了任务成功率,并且所提出的度量在结合成功演示时能够识别出有益于学习的失败样本。这些结果表明,所提出的方法可以支持在机器人数据收集管道中更高效地利用收集到的演示数据。

英文摘要

Imitation learning for robotic tasks has relied primarily on policies trained only on successful demonstrations, although failures are unavoidable during human data collection. Many existing approaches for exploiting failure data require additional data processing or iterative policy updates through autonomous rollouts, making it difficult to directly and stably utilize failure data accumulated during data collection. In this work, we propose a method that learns latent representations of success-failure discrepancies and incorporates them into the attention mechanism. During inference, an appropriate latent mode is selected from the initial observation to improve action stability. Furthermore, we introduce a post-training metric that quantifies the attention discrepancy between each failure sample and successful demonstrations to select failure data. Simulation results show that the proposed method improves task success rates when trained with failure data and that the proposed metric identifies failure samples that are beneficial for learning when combined with successful demonstrations. These results suggest that the proposed method can support more efficient use of collected demonstrations in robotic data collection pipelines.

2604.27195 2026-05-21 cs.AI

Evaluating TabPFN for Mild Cognitive Impairment to Alzheimer's Disease Conversion in Data Limited Settings

评估TabPFN在数据有限环境下对轻度认知障碍向阿尔茨海默病转化的预测能力

Brad Ye, Bulent Soykan, Gulsah Hancerliogullari Koksalmis, Hsin-Hsiung Huang, Laura J. Brattain

发表机构 * 1 Department of Medicine, University of Central Florida College of Medicine, Orlando, FL, USA 2 Department of Mechanical, Industrial \& Manufacturing Eng., The University of Toledo, Toledo, OH, USA 3 Department of Industrial Eng. Mngt. Systems, University of Central Florida, Orlando, FL, USA 4 Department of Industrial Engineering, Istanbul Technical University, Istanbul, Turkey 5 School of Data, Mathematical Statistical Sciences, University of Central Florida, Orlando, FL, USA

AI总结 本文评估了TabPFN在数据有限环境下对轻度认知障碍向阿尔茨海默病转化的预测能力,通过比较传统机器学习方法,发现TabPFN在低数据量情况下表现更优,特别是在训练样本仅为50时仍能保持较高的AUC值。

Comments 6 pages, 3 figures

详情
AI中文摘要

准确预测轻度认知障碍(MCI)向阿尔茨海默病(AD)的转化对于早期干预至关重要,然而由于纵向数据有限,开发可靠的预测模型具有挑战性。我们评估了TabPFN(表格预训练基础网络)在使用TADPOLE数据集(源自ADNI)预测3年MCI至AD转化的性能,该数据集包含来自人口统计学、APOE4、MRI体积、CSF标记物和PET成像的多模态生物标志物特征。我们进行了不同训练集大小(N=50到1000)和模型(包括XGBoost、随机森林、LightGBM和逻辑回归)的实验比较。TabPFN在AUC=0.892时表现最佳,优于LightGBM(AUC=0.860),并在低数据情况下具有优势。在N=50训练样本时,TabPFN保持了强AUC,而传统机器学习模型在小样本时表现不佳。这些发现表明,基础模型在数据有限的情况下对疾病预测具有前景,如阿尔茨海默病。

英文摘要

Accurate prediction of conversion from Mild Cognitive Impairment (MCI) to Alzheimers Diseases (AD) is essential for early intervention, however, developing reliable conversion predictive models is difficult to develop due to limited longitudinal data availability We evaluate TabPFN (Tabular Pre-Trained Foundation Network) against traditional machine learning methods for predicting 3 year MCI to AD conversion using the TADPOLE dataset derived from ADNI. Using multimodal biomarker features extracted from demographics, APOE4, MRI volumes, CSF markers, and PET imaging, we conducted an experimental comparison across varying training set sizes (N=50 to 1000) and models including XGBoost, Random Forest, LightGBM, and Logistic Regression. TabPFN achieved one the highest performance (AUC=0.892), outperforming LightGBM (AUC=0.860) and demonstrating advantages in low data settings. At N=50 training samples, TabPFN maintained strong AUC while the traditional machine learning models struggles at small training samples. These findings demonstrate that foundation models are promising for disease prediction in data limited scenarios, such as Alzheimers diseases.

2603.24472 2026-05-21 cs.CL cs.LG

Why Does Self-Distillation (Sometimes) Degrade the Reasoning Capability of LLMs?

为什么自蒸馏(有时)会降低大语言模型的推理能力?

Jeonghye Kim, Xufang Luo, Minbeom Kim, Sangmook Lee, Dohyung Kim, Jiwon Jeon, Dongsheng Li, Yuqing Yang

发表机构 * Microsoft Research(微软研究院) KAIST(韩国成均馆大学) Seoul National University(首尔国立大学)

AI总结 本文研究了自蒸馏在数学推理中降低大语言模型推理能力的原因,发现其通过抑制模型在推理过程中的不确定性表达,导致在未见过的问题上表现下降,强调了适当表达不确定性对鲁棒推理的重要性。

Comments Code is available at https://github.com/beanie00/self-distillation-analysis

详情
AI中文摘要

自蒸馏作为一种有效的预训练后训练范式,通常在缩短推理轨迹的同时提升性能。然而,在数学推理中,我们发现它会减少响应长度并降低性能。我们追溯这种退化到对“信念性言语化”的抑制——模型在推理过程中表达不确定性。通过控制实验,变化条件上下文的丰富性和任务覆盖范围,我们发现使教师模型在丰富信息上进行条件会抑制不确定性表达,从而在有限的任务覆盖范围内实现快速的领域内优化,但损害了领域外性能,其中未见过的问题受益于表达不确定性并进行相应调整。在Qwen3-1.7B/8B、DeepSeek-Distill-Qwen-7B和Olmo3-7B-Instruct上,我们观察到性能下降高达40%。我们的发现强调了适当表达不确定性对于鲁棒推理的重要性,并突显了优化推理行为的重要性,而不仅仅是强化正确答案轨迹。

英文摘要

Self-distillation has emerged as an effective post-training paradigm for LLMs, often improving performance while shortening reasoning traces. However, in mathematical reasoning, we find that it can reduce response length while degrading performance. We trace this degradation to the suppression of epistemic verbalization - the model's expression of uncertainty during reasoning. Through controlled experiments varying conditioning context richness and task coverage, we show that conditioning the teacher on rich information suppresses uncertainty expression, enabling rapid in-domain optimization with limited task coverage but harming OOD performance, where unseen problems benefit from expressing uncertainty and adjusting accordingly. Across Qwen3-1.7B/8B, DeepSeek-Distill-Qwen-7B, and Olmo3-7B-Instruct, we observe performance drops of up to 40%. Our findings highlight that exposing appropriate levels of uncertainty is crucial for robust reasoning and underscore the importance of optimizing reasoning behavior beyond merely reinforcing correct answer traces.

2603.24139 2026-05-21 cs.CV cs.LG

Tutor-Student Reinforcement Learning: A Dynamic Curriculum for Robust Deepfake Detection

tutor-student 强化学习:一种动态课程以实现鲁棒的深度伪造检测

Zhanhe Lei, Zhongyuan Wang, Jikang Cheng, Baojin Huang, Yuhong Yang, Zhen Han, Chao Liang, Dengpan Ye

发表机构 * School of Computer Science, Wuhan University(武汉大学计算机学院) School of Integrated Circuits, Peking University(北京大学集成电路学院) School of Information, Huazhong Agricultural University(华中农业大学信息学院) Cyberspace Institute of Advanced Technology, Guangzhou University(广州大学先进技术网络研究院)

AI总结 本文提出了一种 tutor-student 强化学习框架,通过动态优化训练课程来提高深度伪造检测的鲁棒性和泛化能力。

Comments Accepted to CVPR 2026

Journal ref The IEEE/CVF Conference on Computer Vision and Pattern Recognition 2026 (CVPR 2026)

详情
AI中文摘要

标准的监督训练将所有样本视为同等重要,这在学习鲁棒且可泛化的特征方面可能是次优的。在本工作中,我们提出了一种新颖的 tutor-student 强化学习 (TSRL) 框架,以动态优化训练课程。我们的方法将训练过程建模为马尔可夫决策过程,其中一个 ``tutor'' agent 学习引导一个 ``student'' (深度伪造检测器)。tutor 实现为一个近端策略优化 (PPO) agent,观察每个训练样本的丰富状态表示,包括不仅其视觉特征,还包括其历史学习动态,如 EMA 损失和遗忘计数。基于此状态,tutor 通过分配连续权重 (0-1) 到样本的损失,从而动态重新加权训练批次。tutor 的奖励基于 student 的即时性能变化,具体奖励从错误预测转为正确预测的过渡。这种策略促使 tutor 学习一个优先考虑高价值样本的课程,如困难但可学习的例子,从而实现更高效和有效的训练过程。我们证明,这种自适应课程相比传统训练方法提高了 student 对未见操纵技术的泛化能力。代码可在 https://github.com/wannac1/TSRL 上获得。

英文摘要

Standard supervised training for deepfake detection treats all samples with uniform importance, which can be suboptimal for learning robust and generalizable features. In this work, we propose a novel Tutor-Student Reinforcement Learning (TSRL) framework to dynamically optimize the training curriculum. Our method models the training process as a Markov Decision Process where a ``Tutor'' agent learns to guide a ``Student'' (the deepfake detector). The Tutor, implemented as a Proximal Policy Optimization (PPO) agent, observes a rich state representation for each training sample, encapsulating not only its visual features but also its historical learning dynamics, such as EMA loss and forgetting counts. Based on this state, the Tutor takes an action by assigning a continuous weight (0-1) to the sample's loss, thereby dynamically re-weighting the training batch. The Tutor is rewarded based on the Student's immediate performance change, specifically rewarding transitions from incorrect to correct predictions. This strategy encourages the Tutor to learn a curriculum that prioritizes high-value samples, such as hard-but-learnable examples, leading to a more efficient and effective training process. We demonstrate that this adaptive curriculum improves the Student's generalization capabilities against unseen manipulation techniques compared to traditional training methods. Code is available at https://github.com/wannac1/TSRL.

2603.17784 2026-05-21 cs.CV cs.LG

ResNet-50 with Class Reweighting and Anatomy-Guided Temporal Decoding for Gastrointestinal Video Analysis

基于类重加权和解剖引导时间解码的ResNet-50在消化系统视频分析中的应用

Romil Imtiaz, Dimitris K. Iakovidis

发表机构 * Department of Computer Science and Biomedical Informatics, University of Thessaly(塞萨洛尼基大学计算机科学与生物医学信息学系)

AI总结 本文提出了一种多标签消化系统视频分析管道,结合ResNet-50帧分类器和解剖引导的时间事件解码,通过类重加权和解剖引导的解码方法提高稀有病理类别的识别性能,最终在挑战测试集上将时间mAP从0.3801提升到0.4303。

Comments ICPR 2026 RARE-VISION Competition

详情
AI中文摘要

我们开发了一种基于ResNet-50帧分类器的多标签消化系统视频分析管道,随后进行解剖引导的时间事件解码。系统从336x336大小的帧中预测17个标签,包括5个解剖类别和12个病理类别。主要挑战是严重的类别不平衡,尤其是罕见病理标签。为了解决这个问题,我们在训练损失中使用了截断的类别级正权重,这在提高罕见类别学习的同时保持了稳定的优化。在时间阶段,我们发现直接帧到事件的转换与官方地面真实值存在碎片化的不匹配。最终提交因此结合了GT风格的帧级事件组成、解剖投票平滑和基于解剖的病理门控,以及保守的滞回解码器。这种设计在挑战测试集上将最终的时间mAP从0.3801提升到0.4303。

英文摘要

We developed a multi-label gastrointestinal video analysis pipeline based on a ResNet-50 frame classifier followed by anatomy-guided temporal event decoding. The system predicts 17 labels, including 5 anatomy classes and 12 pathology classes, from frames resized to 336x336. A major challenge was severe class imbalance, particularly for rare pathology labels. To address this, we used clipped class-wise positive weighting in the training loss, which improved rare-class learning while maintaining stable optimization. At the temporal stage, we found that direct frame-to-event conversion produced fragmented mismatches with the official ground truth. The final submission therefore combined GT-style framewise event composition, anatomy vote smoothing, and anatomy-based pathology gating with a conservative hysteresis decoder. This design improved the final temporal mAP from 0.3801 to 0.4303 on the challenge test set.

2603.10139 2026-05-21 cs.CL cs.AI cs.CC cs.FL

The Generation-Recognition Asymmetry: Six Dimensions of a Fundamental Divide in Formal Language Theory

生成-识别不对称性:形式语言理论中根本分裂的六个维度

Romain Peyrichou

发表机构 * Independent Researcher(独立研究者)

AI总结 本文探讨了生成和识别在形式语言理论中的不对称性,通过六个维度分析生成和识别之间的差异,并指出生成和识别在计算复杂性、歧义性、方向性、信息可用性、语法推断和时间性等方面存在显著差异。

Comments Submitted to Information and Computation. 32 pages, 6 figures, 4 tables

详情
AI中文摘要

每种形式文法都定义了一种语言,并且原则上可以以三种方式使用:生成字符串(产生)、识别它们(解析)或者在仅有示例的情况下推断出文法本身(文法归纳)。生成和识别在扩展上是等价的——它们描述相同的集合——但在多个独立的方面操作上是不对称的。归纳是一个更复杂的问题:它没有访问已知文法的途径。尽管这个三元组在编译器设计、自然语言处理和形式语言理论中至关重要,但尚未有综述将其视为统一的多维现象。我们识别出六个维度,这些维度使生成和识别产生差异:计算复杂性、歧义性、方向性、信息可用性、文法归纳和时间性。我们证明了常见的“生成容易,解析困难”的描述是误导的:无约束生成是简单的,但受约束生成可以是NP难的。真正的不对称性在于解析总是受约束的(输入已知)而生成不需要。这两个维度——方向性和时间性——之前尚未被识别为生成-识别不对称性的维度。我们将时间维度与Hale(2001)和Levy(2008)的惊奇框架联系起来,认为惊奇正式化了生成者(惊奇=0)和预测在不确定性下的解析者(惊奇>0)之间的时间不对称性。我们回顾了自然语言处理中的双向系统,并观察到双向性已有五十年的历史,但尚未转移到大多数领域特定的应用中。最后,我们讨论了大型语言模型,它们在架构上统一了生成和识别,但在操作上保持了不对称性。

英文摘要

Every formal grammar defines a language and can in principle be used in three ways: to generate strings (production), to recognize them (parsing), or -- given only examples -- to infer the grammar itself (grammar induction). Generation and recognition are extensionally equivalent -- they characterize the same set -- but operationally asymmetric in multiple independent ways. Inference is a qualitatively harder problem: it does not have access to a known grammar. Despite the centrality of this triad to compiler design, natural language processing, and formal language theory, no survey has treated it as a unified, multidimensional phenomenon. We identify six dimensions along which generation and recognition diverge: computational complexity, ambiguity, directionality, information availability, grammar inference, and temporality. We show that the common characterization "generation is easy, parsing is hard" is misleading: unconstrained generation is trivial, but generation under constraints can be NP-hard. The real asymmetry is that parsing is always constrained (the input is given) while generation need not be. Two of these dimensions -- directionality and temporality -- have not previously been identified as dimensions of the generation-recognition asymmetry. We connect the temporal dimension to the surprisal framework of Hale (2001) and Levy (2008), arguing that surprisal formalizes the temporal asymmetry between a generator (surprisal = 0) and a parser that predicts under uncertainty (surprisal > 0). We review bidirectional systems in NLP and observe that bidirectionality has been available for fifty years yet has not transferred to most domain-specific applications. We conclude with a discussion of large language models, which architecturally unify generation and recognition while operationally preserving the asymmetry.

2602.16813 2026-05-21 cs.CL cs.AI

Flow Map Language Models: One-step Language Modeling via Continuous Denoising

流映射语言模型:通过连续去噪实现一步语言建模

Chanhyuk Lee, Jaehoon Yoo, Manan Agarwal, Sheel Shah, Jerry Huang, Aditi Raghunathan, Seunghoon Hong, Nicholas M. Boffi, Jinwoo Kim

发表机构 * Carnegie Mellon University(卡内基梅隆大学)

AI总结 本文提出了一种基于连续流的流映射语言模型,通过在one-hot token嵌入上构建连续流,实现了在质量和速度上优于离散扩散模型的性能,展示了连续流在离散模态生成建模中的潜力。

Comments 58 pages, 40 figures

详情
AI中文摘要

基于离散扩散的语言模型因其在生成速度上优于自回归模型而受到广泛关注。尽管具有潜力,这些模型通常在少量步骤范围内生成质量急剧下降,从而在实践中限制了速度的大幅提升。本文表明,基于连续流的一步语言模型在质量和速度上均优于离散扩散模型。重要的是,我们的连续公式定义了一个独特的流映射,可以直接学习以实现高效的少量步骤推断,这种结构在离散方法中不可用。在此设置中,我们展示了流及其关联的流映射可以通过简单的交叉熵目标学习,这些目标尊重数据的简单几何结构,并且我们识别了三种不同的流映射蒸馏选择,其性能在实践中进行了比较。利用这些见解,我们构建了一个流语言模型(FLM),该模型在One Billion Words(LM1B)和OpenWebText(OWT)数据集上与最先进的离散扩散基线相媲美。然后,我们将FLM蒸馏为流映射语言模型(FMLM),其一步生成超过了最近少量步骤离散扩散语言模型的8步质量。本文的工作挑战了广泛认为离散噪声过程是离散模态生成建模所必需的假设,并为大规模加速语言建模铺平了道路。代码可在https://github.com/david3684/flm获取。

英文摘要

Language models based on discrete diffusion have attracted widespread interest for their potential to provide faster generation than autoregressive models. Despite their promise, these models typically produce samples whose quality sharply degrades in the few-step regime, preventing a dramatic speedup in practice. Here, we show that language models based on continuous flows over one-hot token embeddings can outperform discrete diffusion in both quality and speed. Importantly, our continuous formulation defines a unique flow map that can be learned directly for efficient few-step inference, a structure we show is unavailable to discrete methods. In this setting, we show that both the flow and its associated flow map can be learned with simple cross-entropy objectives that respect the simplex geometry of the data, and we identify three distinct choices for flow map distillation whose performance we compare in practice. Using these insights, we build a flow language model (FLM), a continuous flow that matches state-of-the-art discrete diffusion baselines on the One Billion Words (LM1B) and OpenWebText (OWT) datasets. We then distill FLM into a flow map language model (FMLM), whose one-step generation exceeds the 8-step quality of recent few-step discrete diffusion language models. Our work challenges the widely-held hypothesis that discrete noising processes are necessary for generative modeling over discrete modalities and paves the way toward accelerated language modeling at scale. Code is available at https://github.com/david3684/flm.

2602.13485 2026-05-21 cs.LG stat.ML

Federated Learning of Nonlinear Temporal Dynamics with Graph Attention-based Cross-Client Interpretability

基于图注意力的跨客户端可解释性非线性时序动态联邦学习

Ayse Tursucular, Ayush Mohanty, Nazal Mohamed, Nagi Gebraeel

发表机构 * Georgia Institute of Technology(佐治亚理工学院)

AI总结 本文提出了一种联邦学习框架,用于在分布式非线性系统中学习跨客户端的时序依赖关系。该框架通过非线性状态空间模型将本地高维观测映射到低维潜在状态,并利用图注意力网络在通信的潜在状态上学习图结构的神经状态转移模型,通过将学习的服务器侧转移模型的雅可比矩阵与注意力系数相关联,实现了对跨客户端时序依赖关系的可解释性。

Comments Manuscript under review

详情
AI中文摘要

现代工业系统的网络越来越多地由分布式传感器监控,其中每个系统由多个子系统组成,生成高维时间序列数据。这些子系统通常是相互依赖的,因此理解一个子系统中的时序模式如何与其他子系统相关联变得很重要。在去中心化设置中,原始测量值无法共享,客户端观测是异质的,这使得问题更加复杂。在实际部署中,每个子系统(客户端)运行一个固定的专有模型,无法修改或重新训练,限制了现有方法。非线性动态进一步使跨客户端时序依赖关系难以解释,因为它们嵌入在非线性状态转移函数中。本文提出了一种联邦框架,用于在这些约束下学习跨客户端的时序依赖关系。每个客户端使用非线性状态空间模型将高维本地观测映射到低维潜在状态。中央服务器利用图注意力网络在通信的潜在状态上学习图结构的神经状态转移模型。为了可解释性,我们将学习的服务器侧转移模型的雅可比矩阵与注意力系数相关联,从而首次提供了对去中心化非线性系统中跨客户端时序依赖关系的可解释性描述。我们建立了理论收敛保证,以达到集中化 oracle,并通过合成实验验证了该框架,展示了收敛性、可解释性、可扩展性和隐私。此外,现实世界实验显示其性能与去中心化基线相当。

英文摘要

Networks of modern industrial systems are increasingly monitored by distributed sensors, where each system comprises multiple subsystems generating high dimensional time series data. These subsystems are often interdependent, making it important to understand how temporal patterns at one subsystem relate to others. This is challenging in decentralized settings where raw measurements cannot be shared and client observations are heterogeneous. In practical deployments each subsystem (client) operates a fixed proprietary model that cannot be modified or retrained, limiting existing approaches. Nonlinear dynamics further make cross client temporal interdependencies difficult to interpret because they are embedded in nonlinear state transition functions. We present a federated framework for learning temporal interdependencies across clients under these constraints. Each client maps high dimensional local observations to low dimensional latent states using a nonlinear state space model. A central server learns a graph structured neural state transition model over the communicated latent states using a Graph Attention Network. For interpretability we relate the Jacobian of the learned server side transition model to attention coefficients, providing the first interpretable characterization of cross client temporal interdependencies in decentralized nonlinear systems. We establish theoretical convergence guarantees to a centralized oracle and validate the framework through synthetic experiments demonstrating convergence, interpretability, scalability and privacy. Additional real world experiments show performance comparable to decentralized baselines.

2602.06358 2026-05-21 cs.CL cs.AI

SHINE: A Scalable In-Context Hypernetwork for Mapping Context to LoRA in a Single Pass

SHINE:一种可扩展的上下文超网络,用于在单次传递中将上下文映射到LoRA

Yewei Liu, Xiyuan Wang, Yansheng Mao, Yoav Gelbery, Haggai Maron, Muhan Zhang

发表机构 * Institute for Artificial Intelligence, Peking University(北京大学人工智能研究院) Technion, NVIDIA(技术学院与NVIDIA) University of Oxford(牛津大学) School of Electronics Engineering(电子工程学院) Computer Science, Peking University(计算机科学,北京大学)

AI总结 本文提出SHINE,一种可扩展的上下文超网络,用于在单次传递中将多样且有意义的上下文映射到高质量的LoRA适配器,通过重用冻结LLM的自身参数和引入架构创新,克服了先前超网络的关键限制,以较少的参数实现了强大的表达能力。

详情
AI中文摘要

我们提出SHINE(可扩展的上下文超网络),一种可扩展的超网络,能够将多样且有意义的上下文映射到高质量的LoRA适配器,用于大型语言模型(LLMs)。通过在上下文超网络设计中重用冻结LLM的自身参数,并引入架构创新,SHINE克服了先前超网络的关键限制,以相对较少的参数实现了强大的表达能力。我们引入了预训练和指令微调流水线,并训练我们的超网络在单次前向传递中从多样且有意义的上下文中生成高质量的LoRA适配器。它在不进行微调的情况下更新LLM参数,并立即启用与上下文相关的复杂问答任务,而无需直接访问上下文,有效地将上下文知识转换为参数知识。我们的工作在各种任务上取得了出色的结果,相比基于SFT的LLM适应方法,大大节省了时间和计算和内存成本,并展示了良好的可扩展性潜力。我们的代码可在https://github.com/MuLabPKU/SHINE获取。

英文摘要

We propose SHINE (Scalable Hyper In-context NEtwork), a scalable hypernetwork that can map diverse meaningful contexts into high-quality LoRA adapters for large language models (LLMs). By reusing the frozen LLM's own parameters in an in-context hypernetwork design and introducing architectural innovations, SHINE overcomes key limitations of prior hypernetworks and achieves strong expressive power with a relatively small number of parameters. We introduce a pretraining and instruction fine-tuning pipeline, and train our hypernetwork to generate high quality LoRA adapters from diverse meaningful contexts in a single forward pass. It updates LLM parameters without any fine-tuning, and immediately enables complex question answering tasks related to the context without directly accessing the context, effectively transforming in-context knowledge to in-parameter knowledge in one pass. Our work achieves outstanding results on various tasks, greatly saves time, computation and memory costs compared to SFT-based LLM adaptation, and shows great potential for scaling. Our code is available at https://github.com/MuLabPKU/SHINE

2601.15133 2026-05-21 cs.CV cs.LG

Building Deep Graph Predictors with Graph Imitation Learning

通过图模仿学习构建深度图预测器

André Eberhard, Gerhard Neumann, Pascal Friederich

发表机构 * Karlsruhe Institute of Technology (KIT)(卡尔斯鲁厄理工学院)

AI总结 本文提出GRAIL框架,通过图模仿学习解决图生成中的表示问题,实验证明其在多个基准测试中表现优异。

详情
AI中文摘要

近年来,神经生成文本、图像和音频方面取得了显著进展,得益于成熟的训练流程和大规模优化。然而,对于图而言,这种进展更为有限。我们归因于图特定的优化和表示挑战,这些挑战削弱了通过反向传播和梯度下降训练神经网络的有效性。我们主张在最近提出的监督图预测模型中,将图表示为固定大小的欧几里得网格可能不是最优选择。为了支持我们的观点,我们分析了神经图生成方法,并识别出导致训练神经网络生成图时出现陷阱的理论挑战。受此分析启发,我们引入GRAIL(Graph Imitation Learning),一种用于监督设置的框架,其中监督信号是一个图。GRAIL通过马尔可夫决策过程在部分图的嵌入上依次生成图,从而避免了固定大小网格图表示相关的表示问题。我们实验证明,GRAIL在18个全面的基准测试中实现了具有竞争力的结果,在多个设置中匹配或超过了最先进的方法。

英文摘要

Recent years have seen substantial progress in neural generation of text, images, and audio, supported by mature training pipelines and large-scale optimization. For graphs, however, comparable progress has been more limited. We attribute this gap to graph-specific optimization and representation challenges that undermine the effectiveness of training neural networks with backpropagation and gradient descent. We argue that representing graphs on a fixed-size Euclidean grid, as is common in recently proposed models for supervised graph prediction, may not be the optimal choice in these settings. To support our view, we provide an analysis of neural graph generation methods and identify theoretical challenges that lead to pitfalls when training neural networks to produce graphs as their output. Motivated by this analysis, we introduce \textbf{GRA}ph~\textbf{I}mitation~\textbf{L}earning~(GRAIL), a framework for training neural networks in supervised settings in which the supervision signal is a graph. GRAIL generates graphs sequentially through a Markov decision process over embeddings of partial graphs, thereby avoiding the representation issues associated with fixed-size grid graph representations. We empirically show that GRAIL achieves competitive results on supervised graph prediction across a comprehensive suite of 18 benchmarks, matching or surpassing state-of-the-art methods in several settings.

2601.10191 2026-05-21 cs.AI

How does downsampling affect needle electromyography signals? A generalisable workflow for understanding downsampling effects on high-frequency time series

降采样如何影响针式肌电信号?一种可推广的用于理解降采样对高频时间序列影响的工作流程

Mathieu Cherpitel, Janne Luijten, Thomas Bäck, Camiel Verhamme, Martijn Tannemaat, Anna Kononova

发表机构 * Leiden Institute of Advanced Computer Science(莱顿先进计算机科学研究所) Leiden University Medical Centre(莱顿大学医学中心) Department of Neurology(神经科) Amsterdam University Medical Centre(阿姆斯特丹大学医学中心)

AI总结 本文研究降采样对高频时间序列信号的影响,提出了一种可推广的工作流程,通过结合形状基失真度量和基于特征的机器学习模型分类结果,量化不同降采样算法和因素对波形完整性和预测性能的影响。

详情
AI中文摘要

自动分析针式肌电(nEMG)信号正逐渐成为一种支持神经肌肉疾病(NMDs)检测的工具,但这些信号的高采样率和异质性给基于特征的机器学习模型带来了显著的计算挑战,特别是接近实时分析。降采样提供了一个潜在的解决方案,但其对诊断信号内容和分类性能的影响仍不够清楚。本研究提出了一种系统评估降采样导致信息损失的工作流程。该工作流程结合了基于形状的失真度量与现有基于特征的机器学习模型的分类结果和特征空间分析,以量化不同降采样算法和因素对波形完整性和预测性能的影响。我们使用三类NMD分类任务来实验性评估该工作流程。我们展示了该工作流程如何识别保留诊断信息同时显著减少计算负载的降采样配置。形状基失真度量分析显示,基于形状的降采样算法优于标准降采样,因为它们更好地保留了峰值结构和整体信号形态。结果为选择能够支持接近实时nEMG分析的降采样配置提供了实用指导,并突显了一种可推广的工作流程,可用于在其他高频时间序列应用中平衡数据减少与模型性能。

英文摘要

Automated analysis of needle electromyography (nEMG) signals is emerging as a tool to support the detection of neuromuscular diseases (NMDs), yet the signals' high and heterogeneous sampling rates pose substantial computational challenges for feature-based machine-learning models, particularly for near real-time analysis. Downsampling offers a potential solution, but its impact on diagnostic signal content and classification performance remains insufficiently understood. This study presents a workflow for systematically evaluating information loss caused by downsampling in high-frequency time series. The workflow combines shape-based distortion metrics with classification outcomes from available feature-based machine learning models and feature space analysis to quantify how different downsampling algorithms and factors affect both waveform integrity and predictive performance. We use a three-class NMD classification task to experimentally evaluate the workflow. We demonstrate how the workflow identifies downsampling configurations that preserve diagnostic information while substantially reducing computational load. Analysis of shape-based distortion metrics showed that shape-aware downsampling algorithms outperform standard decimation, as they better preserve peak structure and overall signal morphology. The results provide practical guidance for selecting downsampling configurations that enable near real-time nEMG analysis and highlight a generalisable workflow that can be used to balance data reduction with model performance in other high-frequency time-series applications as well.

2511.04520 2026-05-21 cs.CV

THEval. Evaluation Framework for Talking Head Video Generation

THEval:谈话头视频生成的评估框架

Nabyl Quignon, Baptiste Chopin, Yaohui Wang, Antitza Dantcheva

发表机构 * Inria Centre at Université Côte d’Azur(Inria 雅克-路易·科蒂尔大学中心) Shanghai AI Laboratory(上海人工智能实验室) da/sec - Biometrics and Security Research Group, Hochschule Darmstadt(da/sec 生物识别与安全研究组,达姆斯塔特大学)

AI总结 本文提出了一种新的评估框架,用于评估生成谈话头视频的质量、自然性和同步性,通过8个指标来衡量,并通过大量实验验证了现有方法在生成表情和无瑕疵细节方面的不足。

Comments CVPR 2026 Findings, Project Page: https://newbyl.github.io/theval_project_page/

详情
AI中文摘要

视频生成已经取得了显著进展,生成的视频越来越接近真实视频。然而,生成技术的快速发展已经超过了评估指标的完善速度。目前,对谈话头生成的评估主要依赖于有限的指标,评估一般视频质量、唇同步以及进行用户研究。受此启发,我们提出了一种新的评估框架,包含8个指标,涵盖三个维度(i)质量,(ii)自然性,(iii)同步性。在选择指标时,我们强调效率以及与人类偏好的一致性。基于这些考虑,我们分析了头部、嘴巴和眉毛的细粒度动态以及面部质量。我们在85,000个由17种最先进的模型生成的视频上进行了广泛的实验,发现尽管许多算法在唇同步方面表现优异,但在生成表情和无瑕疵细节方面面临挑战。这些视频是基于一个新实测数据集生成的,我们为此数据集进行了精心挑选,以减少训练数据的偏见。我们提出的基准框架旨在评估生成方法的改进。原始代码、数据集和排行榜将被公开发布并定期更新,以反映该领域的发展进展。

英文摘要

Video generation has achieved remarkable progress, with generated videos increasingly resembling real ones. However, the rapid advance in generation has outpaced the development of adequate evaluation metrics. Currently, the assessment of talking head generation primarily relies on limited metrics, evaluating general video quality, lip synchronization, and on conducting user studies. Motivated by this, we propose a new evaluation framework comprising 8 metrics related to three dimensions (i) quality, (ii) naturalness, and (iii) synchronization. In selecting the metrics, we place emphasis on efficiency, as well as alignment with human preferences. Based on this considerations, we streamline to analyze fine-grained dynamics of head, mouth, and eyebrows, as well as face quality. Our extensive experiments on 85,000 videos generated by 17 state-of-the-art models suggest that while many algorithms excel in lip synchronization, they face challenges with generating expressiveness and artifact-free details. These videos were generated based on a novel real dataset, that we have curated, in order to mitigate bias of training data. Our proposed benchmark framework is aimed at evaluating the improvement of generative methods. Original code, dataset and leaderboards will be publicly released and regularly updated with new methods, in order to reflect progress in the field.

2510.14737 2026-05-21 cs.CV

Free-Grained Hierarchical Visual Recognition

自由粒度层次视觉识别

Seulki Park, Zilin Wang, Stella X. Yu

发表机构 * University of Michigan(密歇根大学) UC Berkeley(伯克利大学)

AI总结 本文研究了在现实世界中标签不完整且粒度混合的情况下,如何进行层次视觉识别。通过引入自由粒度训练方法,结合文本监督和半监督学习,改进了传统层次方法在不完整监督下的性能,并提出了自由粒度推理机制以适应不同预测深度的需求。

Comments Accepted to CVPR 2026. 31 pages

详情
AI中文摘要

层次图像识别旨在沿着语义分类学预测类别标签,从广义类别到具体类别。通常假设每张训练图像在其分类路径上完全标注。现实更复杂:远处的鸟可能仅被标记为鸟,而清晰的特写可能证明是 bald eagle。我们引入了自由粒度训练,其中标签可能出现在分类学的任何层次,模型必须从不完整、混合粒度的监督中学习一致的层次预测。我们构建了具有不同标签粒度的基准数据集,并展示了现有层次方法在该设置下性能急剧下降。为弥补缺失的监督,我们提出了两种简单解决方案:一种是添加基于文本的广泛监督以捕捉视觉属性,另一种是将特定分类学层次中缺失的标签视为半监督学习问题。我们还研究了自由粒度推理,其中模型选择预测深度,当细粒度预测不确定时返回可靠的粗粒度标签。整体而言,我们的任务、数据集和方法使层次识别更接近现实世界中标签的产生方式。

英文摘要

Hierarchical image recognition seeks to predict class labels along a semantic taxonomy, from broad categories to specific ones, typically under the tidy assumption that every training image is fully annotated along its taxonomy path. Reality is messier: A distant bird may be labeled only bird, while a clear close-up may justify bald eagle. We introduce free-grain training, where labels may appear at any level of the taxonomy and models must learn consistent hierarchical predictions from incomplete, mixed-granularity supervision. We build benchmark datasets with varying label granularity and show that existing hierarchical methods deteriorate sharply in this setting. To make up for missing supervision, we propose two simple solutions: One adds broad text-based supervision that captures visual attributes, and the other treats missing labels at specific taxonomy levels as a semi-supervised learning problem. We also study free-grained inference, where the model chooses how deep to predict, returning a reliable coarse label when a fine-grained one is uncertain. Together, our task, datasets, and methods move hierarchical recognition closer to the way labels arise in the real world.

2509.07120 2026-05-21 cs.CV

Block-Sparse Global Attention for Efficient Multi-View Geometry Transformers

块稀疏全局注意力:用于高效多视图几何变换器

Chung-Shien Brian Wang, Christian Schmidt, Jens Piekenbrinck, Bastian Leibe

发表机构 * Computer Vision Group(计算机视觉组) RWTH Aachen University(亚琛工业大学)

AI总结 本文提出了一种块稀疏替代密集全局注意力的方法,通过优化内核实现高效多视图重建,显著提升处理大规模图像集的可扩展性。

Comments Project page at https://vision.rwth-aachen.de/sparse-vggt

详情
AI中文摘要

高效且准确的前馈多视图重建长期以来一直是计算机视觉中的重要任务。最近的基于变换器的模型,如VGGT、π³和MapAnything,通过相对简单的架构展示了显著的性能。然而,它们的可扩展性从根本上受到全局注意力二次复杂度的限制,这在处理大规模图像集时会带来显著的运行时间瓶颈。在本工作中,我们通过实证分析这些模型的全局注意力矩阵,并观察到概率质量集中在一小部分补丁-补丁交互上,这些交互对应于跨视图几何对应关系。基于这一见解并受近期大语言模型进展的启发,我们提出了一种无需训练的块稀疏替代密集全局注意力方法,通过高度优化的内核实现。我们的方法在保持可比任务性能的同时,将推理速度提高了超过3倍。在全面的多视图基准测试中,我们的方法无缝集成到现有的基于全局注意力的架构中,如VGGT、π³和MapAnything,同时显著提高了处理大规模图像集的可扩展性。

英文摘要

Efficient and accurate feed-forward multi-view reconstruction has long been an important task in computer vision. Recent transformer-based models like VGGT, $π^3$ and MapAnything have demonstrated remarkable performance with relatively simple architectures. However, their scalability is fundamentally constrained by the quadratic complexity of global attention, which imposes a significant runtime bottleneck when processing large image sets. In this work, we empirically analyze the global attention matrix of these models and observe that the probability mass concentrates on a small subset of patch-patch interactions corresponding to cross-view geometric correspondences. Building on this insight and inspired by recent advances in large language models, we propose a training-free, block-sparse replacement for dense global attention, implemented with highly optimized kernels. Our method accelerates inference by more than $3\times$ while maintaining comparable task performance. Evaluations on a comprehensive suite of multi-view benchmarks demonstrate that our approach seamlessly integrates into existing global attention-based architectures such as VGGT, $π^3$ , and MapAnything, while substantially improving scalability to large image collections.

2507.21168 2026-05-21 cs.CL cs.AI cs.LG

Diverse LLMs or Diverse Question Interpretations? That is the Ensembling Question

多样化的大语言模型还是多样化的问题解释?那是集成的问题

Rafael Rosales, Santiago Miret

发表机构 * Intel Labs(英特尔实验室)

AI总结 本文比较了使用大语言模型回答二元问题的两种多样性方法:模型多样性和问题解释多样性,并发现问题解释多样性在集成准确性上表现更优。

Journal ref Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026), pages 5116-5128

详情
AI中文摘要

有效利用多样性已被证明可以提高各种机器学习模型,包括大语言模型(LLMs)的性能。然而,确定最有效的多样性使用方法仍是一个挑战。在本工作中,我们比较了两种用于使用LLMs回答二元问题的多样性方法:模型多样性,即多个模型回答相同的问题,以及问题解释多样性,即使用同一模型以不同方式 framing 相同的问题来回答。对于这两种情况,我们应用多数投票作为集成共识启发式方法来确定最终答案。我们的boolq、strategyqa和pubmedqa实验表明,问题解释多样性在集成准确性上始终优于模型多样性。此外,我们对GPT和LLaMa的分析表明,模型多样性通常产生在最佳和最差集成成员之间的结果,而没有明显的改进。

英文摘要

Effectively leveraging diversity has been shown to improve performance for various machine learning models, including large language models (LLMs). However, determining the most effective way of using diversity remains a challenge. In this work, we compare two diversity approaches for answering binary questions using LLMs: model diversity, which relies on multiple models answering the same question, and question interpretation diversity, which relies on using the same model to answer the same question framed in different ways. For both cases, we apply majority voting as the ensemble consensus heuristic to determine the final answer. Our experiments on boolq, strategyqa, and pubmedqa show that question interpretation diversity consistently leads to better ensemble accuracy compared to model diversity. Furthermore, our analysis of GPT and LLaMa shows that model diversity typically produces results between the best and the worst ensemble members without clear improvement.

2506.04042 2026-05-21 cs.CL

Causal Path Alignment: Anchoring the Optimization Trajectory for Controllable In-Parameter Knowledge Editing

因果路径对齐:为可控的参数知识编辑锚定优化轨迹

Xiyu Liu, Zhengxiao Liu, Naibin Gu, Zheng Lin, Weiping Wang

发表机构 * Institute of Information Engineering, Chinese Academy of Sciences(中国科学院信息工程研究所) School of Cyber Security, University of Chinese Academy of Sciences(中国科学院大学网络安全学院)

AI总结 本文提出Causal Path Alignment框架,通过锚定优化轨迹来解决参数知识编辑中的主体主导记忆干扰问题,提升关系特异性并减少副作用。

Comments Accepted by IJCAI 2026

详情
AI中文摘要

知识编辑对于高效更新大型语言模型(LLM)的参数记忆至关重要,使它们能够在动态环境中作为演进代理发挥作用。然而,主流的参数知识编辑方法存在主体主导记忆干扰问题:修改特定事实会无意中破坏与同一主题相关的更广泛结构知识。我们诊断根本原因是捷径学习病理,即优化目标过拟合了主体表示,而绕过了本质的关系上下文。为此,我们提出因果路径对齐(CPA),一种原理性的框架,旨在将优化轨迹锚定到有效的因果路径上。CPA强制参数更新通过关系意识的中间状态,从而防止上下文依赖性的丢失。在多种LLM基础架构上的实验结果表明,CPA一致消除了捷径,显著提高了关系特异性,同时表现出最小的副作用。此外,CPA为现有编辑器提供了一种模型无关的插件,为可靠和可信的参数知识编辑铺平了道路。

英文摘要

Knowledge editing is pivotal for efficiently updating the parametric memory of Large Language Models (LLMs), enabling them to function as evolving agents in dynamic environments. However, mainstream in-parameter knowledge editing approaches suffer from Subject-Dominant Memory Interference: modifying a specific fact inadvertently corrupts the broader structural knowledge associated with the same subject within LLMs. We diagnose the root cause as a shortcut learning pathology, where the optimization objective overfits subject representations while bypassing the essential relational context. To rectify this, we propose Causal Path Alignment (CPA), a principled framework designed to anchor the optimization trajectory to valid causal pathways. CPA enforces parameter updates to route through relation-aware intermediate states, thereby preventing the erasure of contextual dependencies. Experimental results across diverse LLM backbones demonstrate that CPA consistently eliminates the shortcut, significantly improving relation specificity while exhibiting minimal side-effects. Moreover, CPA serves as a model-agnostic plug-in for existing editors, paving the way for reliable and trustworthy in-parameter knowledge editing.

2501.01793 2026-05-21 cs.LG cs.AI

Creating Artificial Students that Never Existed: Leveraging Large Language Models and CTGANs for Synthetic Data Generation

创建从未存在过的虚拟学生:利用大型语言模型和CTGANs进行合成数据生成

Mohammad Khalil, Sam Urmian, Ronas Shakya, Qinyi Liu

发表机构 * Centre for the Science of Learning & Technology (SLATE)(学习科学与技术中心(SLATE)) University of Bergen(卑尔根大学)

AI总结 本文研究了利用生成对抗网络(GANs)和大型语言模型(LLMs)生成合成表格数据的潜力,探讨了通过合成数据创建虚拟学生以服务于学习分析模型的可能性,并评估了不同生成模型的性能。

详情
AI中文摘要

在本研究中,我们探索了人工智能和深度学习技术,特别是生成对抗网络(GANs)和大型语言模型(LLMs)在生成合成表格数据方面的成长潜力。获取高质量学生数据对于推进学习分析至关重要,但隐私问题和全球更严格的数据保护法规限制了其可用性和使用。合成数据提供了一个有前途的替代方案。我们探讨了是否可以利用合成数据来创建虚拟学生以服务于学习分析模型。使用流行的GAN模型CTGAN和三种LLMs-GPT2、DistilGPT2和DialoGPT,我们生成了合成的表格学生数据。我们的结果表明,这些方法具有强大的潜力,能够生成高质量的合成数据集,与真实学生数据相似。为了验证我们的发现,我们应用了一套全面的效用评估指标来评估合成数据的统计和预测性能,并比较了不同生成模型,特别是LLMs的性能。本研究旨在为学习分析社区提供有价值的见解,为扩展学习分析领域的方法学工具箱提供新的创新方法。

英文摘要

In this study, we explore the growing potential of AI and deep learning technologies, particularly Generative Adversarial Networks (GANs) and Large Language Models (LLMs), for generating synthetic tabular data. Access to quality students data is critical for advancing learning analytics, but privacy concerns and stricter data protection regulations worldwide limit their availability and usage. Synthetic data offers a promising alternative. We investigate whether synthetic data can be leveraged to create artificial students for serving learning analytics models. Using the popular GAN model CTGAN and three LLMs- GPT2, DistilGPT2, and DialoGPT, we generate synthetic tabular student data. Our results demonstrate the strong potential of these methods to produce high-quality synthetic datasets that resemble real students data. To validate our findings, we apply a comprehensive set of utility evaluation metrics to assess the statistical and predictive performance of the synthetic data and compare the different generator models used, specially the performance of LLMs. Our study aims to provide the learning analytics community with valuable insights into the use of synthetic data, laying the groundwork for expanding the field methodological toolbox with new innovative approaches for learning analytics data generation.