arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2328
专题追踪
2605.11330 2026-05-13 cs.AI

Rethinking Evaluation for LLM Hallucination Detection: A Desiderata, A New RAG-based Benchmark, New Insights

Wenbo Chen, Veena Padmanabhan, Tootiya Giyahchi, Elaine Wong, Leman Akoglu

发表机构 * Amazon(亚马逊公司) Carnegie Mellon University(卡内基梅隆大学)

AI总结 本文针对大语言模型(LLM)幻觉检测的评估方法进行了重新思考,提出了一个用于构建有效幻觉检测基准(HDB)的期望属性列表,并指出现有基准在长上下文的RAG(检索增强生成)基准和真实标签噪声支持方面存在明显不足。为此,作者构建并开源了一个新的RAG-based幻觉检测基准T RIVIA+,该基准包含当前最长的上下文样本,并引入了多种噪声标签以模拟真实场景。实验表明,现有检测方法在RAG任务上仍有较大提升空间,且标签噪声对检测性能有显著影响。

Comments ACL 2026 main conference

详情
英文摘要

Hallucination, broadly referring to unfaithful, fabricated, or inconsistent content generated by LLMs, has wide-ranging implications. Therefore, a large body of effort has been devoted to detecting LLM hallucinations, as well as designing benchmark datasets for evaluating these detectors. In this work, we first establish a desiderata of properties for hallucination detection benchmarks (HDBs) to exhibit for effective evaluation. A critical look at existing HDBs through the lens of our desiderata reveals that none of them exhibits all the properties. We identify two largest gaps: (1) RAG-based grounded benchmarks with long context are severely lacking (partly because length impedes human annotation); and (2) Existing benchmarks do not make available realistic label noise for stress-testing detectors although real-world use-cases often grapple with label noise due to human or automated/weak annotation. To close these gaps, we build and open-source a new RAG-based HDB called T RIVIA+ that underwent a rigorous human annotation process. Notably, our benchmark exhibits all desirable properties including (1) T RIVIA+ contains samples with the longest context in the literature; and (2) we design and share four sets of noisy labels with different, both sample-dependent and sampleindependent, noise schemes. Finally, we perform experiments on RAG-based HDBs, including our T RIVIA+, using popular SOTA detectors that reveal new insights: (i) ample room remains for current detectors to reach the performance ceiling on RAG-based HDBs, (ii) the basic LLM-as-a-Judge baseline performs competitively, and (iii) label noise hinders detection performance. We expect that our findings, along with our proposed benchmark 1 , will motivate and foster needed research on hallucination detection for RAG-based tasks.

2605.11328 2026-05-13 cs.LG cs.AI

Epistemic Uncertainty for Test-Time Discovery

Kainat Riaz, Muhammad Ahmed Mohsin, Ahsan Bilal, Muhammad Umer, Ayesha Mohsin, Aqib Riaz, Ali Subhan, John M. Cioffi

发表机构 * Stanford University(斯坦福大学) National University of Sciences and Technology(国家安全科学与技术大学) University of Oklahoma(俄克拉荷马大学)

AI总结 该研究探讨了如何利用大语言模型在测试阶段进行科学发现的问题,指出传统强化学习方法因惩罚高方差变异而倾向于熟悉模式,导致奖励难以持续提升。为此,研究提出了一种基于知识不确定性度量的探索策略,通过维护一个小型适配器集成,在冻结的基模型上识别出因训练覆盖不足而非问题本质困难的区域,从而引导策略向潜在发现区域探索。实验表明,该方法在多个科学发现任务中提升了最大奖励并保持了更高的解的多样性。

详情
英文摘要

Automated scientific discovery using large language models relies on identifying genuinely novel solutions. Standard reinforcement learning penalizes high-variance mutations, which leads the policy to prioritize familiar patterns. As a result, the maximum reward plateaus even as the average reward increases. Overcoming this limitation requires a signal that distinguishes unexplored regions from intrinsically difficult problems. This necessitates measuring disagreement across independently adapted weight hypotheses rather than relying on a single network's confidence. UG-TTT addresses this challenge by maintaining a small ensemble of low-rank adapters over a frozen base model. The per-token disagreement, quantified as the mutual information between ensemble predictions and weight hypotheses, isolates epistemic uncertainty and identifies positions where insufficient coverage leads to adapter divergence rather than intrinsic problem difficulty. This measure is incorporated as an exploration bonus into the policy gradient, directing the policy toward positions where persistent adapter disagreement signals low training coverage, the same frontier where genuine discovery is possible. A nuclear norm regularizer ensures the adapters remain distinct from one another, thereby preserving the exploration signal throughout training. Across four scientific discovery benchmarks, UG-TTT increases the maximum reward on three tasks, maintains substantially higher solution diversity, and an ablation study confirms that the regularizer is essential for sustaining this behavior.

2605.11327 2026-05-13 cs.LG

Neural Statistical Functions

Daniel Xu, Yuxin Xie, Minghao Guo, Haixu Wu, Wojciech Matusik

发表机构 * Columbia University(哥伦比亚大学) MIT CSAIL(麻省理工学院计算机科学与人工智能实验室)

AI总结 本文提出了一种新型神经统计函数模型,用于直接估计连续操作条件范围内的统计量,避免了传统方法中重复推理带来的高延迟问题。该方法基于预训练的单样本预测器和散点数据,通过引入前缀统计的概念,将积分、分位数和极值等不同统计函数统一到一个区间条件框架中,并以前缀统计与个体回归之间的原理性一致性作为学习目标。实验表明,该模型在动力系统能量累积、气动响应分位数和碰撞过程最大应力等复杂物理过程的统计估计中表现出色,模型评估次数最多可减少100倍。

详情
英文摘要

Classical deep learning typically operates on individual cases. Despite its success, real-world usage often requires repeated inference to estimate statistical quantities for complex decision-making tasks involving uncertainty or extreme-value analysis, resulting in substantial latency. We introduce neural statistical functions, a new family of models learned from pre-trained single-sample predictors and scattered data samples, which can directly infer statistics over continuous operating condition ranges without explicit sampling. By introducing the notion of prefix statistics, we transform and unify diverse statistical functions (e.g., integrals, quantiles, and maxima) into an interval-conditional framework, in which a principled identity between the prefix statistics and the individual-case regression serves as the learning objective. Neural statistical functions achieve strong performance in estimating essential statistics of complex physical processes, including accumulated energy in dynamical systems, quantiles of aerodynamic responses, and maximum stress in crash processes, while achieving up to a 100$\times$ reduction in model evaluations.

2605.11324 2026-05-13 cs.LG stat.ML

$\varepsilon$-Good Action Identification in Fixed-Budget Monte Carlo Tree Search

Yinan Li, Tuan Nguyen, Kwang-Sung Jun

发表机构 * Department of Computer Science(计算机科学系) University of Arizona(亚利桑那大学) CSE/GSAI POSTECH(POSTECH CSE/GSAI)

AI总结 本文研究了在固定预算下深度为2的max-min树中识别ε-优质动作的问题,这是蒙特卡洛树搜索的一个重要特例。作者提出了一种无需输入ε值的算法,能够针对每个有意义的ε值实现实例相关的误差界,其误识别概率以指数形式衰减。此外,作者还分析了该问题与标准K臂老虎机在难度结构上的差异,并提供了相应的下界结果,这是首个针对max-min动作识别的固定预算算法保证。

详情
英文摘要

We study the fixed-budget max-min action identification problem in depth-2 max-min trees, an important special case of Monte Carlo Tree Search. A learner sequentially allocates $T$ samples to leaves and then recommends a subtree whose minimum leaf value is largest. Motivated by approximate planning, we focus on $\varepsilon$-good subtree identification, where any subtree whose min value is within $\varepsilon$ of the optimal maximin value is acceptable. Our main contribution is an $\varepsilon$-agnostic algorithm: it does not require $\varepsilon$ as input, but achieves instance-dependent error bounds for every meaningful $\varepsilon$. We show that the misidentification probability decays as $\exp(-\widetildeΘ(T/H_2(\varepsilon)))$, where $H_2(\varepsilon)$ captures both cross-subtree and within-subtree gaps. When each subtree has a single leaf, the problem reduces to standard fixed-budget best-arm identification, and our analysis recovers, up to accelerating factors, known $\varepsilon$-good guarantees for halving-style methods while giving a new $\varepsilon$-good guarantee for Successive Rejects. On the lower-bound side, we provide complementary positive and negative results showing that max-min identification has a different hardness structure from standard $K$-armed bandits. To our knowledge, this is the first provable fixed-budget algorithmic guarantee for max-min action identification.

2605.11317 2026-05-13 cs.CL cs.AI

SOMA: Efficient Multi-turn LLM Serving via Small Language Model

Xueqi Cheng, Qiong Wu, Zhengyi Zhou, Xugui Zhou, Tyler Derr, Yushun Dong

发表机构 * Florida State University(佛罗里达州立大学) AT&T Chief Data Office(AT&T首席数据办公室) Louisiana State University(路易斯安那州立大学) Vanderbilt University(范德比大学)

AI总结 在多轮对话场景中,大型语言模型(LLMs)的部署面临延迟、内存和API成本高昂的问题。为此,本文提出SOMA框架,通过利用会话早期的对话内容估计局部响应流形,并使用一个小的语言模型作为代理模型处理后续对话,从而在保证响应质量的同时提升服务效率。该方法结合软提示学习、反退化控制和局部LoRA微调,实现了代理模型在推理阶段无需提示的高效运行,并提供了理论分析与实验验证,证明了其有效性。

详情
英文摘要

Large Language Models (LLMs) are increasingly deployed in multi-turn dialogue settings where preserving conversational context across turns is essential. A standard serving practice concatenates the full dialogue history at every turn, which reliably maintains coherence but incurs substantial cost in latency, memory, and API expenditure, especially when queries are routed to large proprietary models. Existing approaches often struggle to balance the trade-off between response quality and efficiency. We propose a framework that exploits the early turns of a session to estimate a local response manifold and then adapt a smaller surrogate model to this local region for the remainder of the conversation. Concretely, we learn soft prompts that maximize semantic divergence between the large and surrogate small language models' responses to surface least-aligned local directions, stabilize training with anti-degeneration control, and distill the mined cases into localized LoRA fine-tuning so the surrogate runs without prompts at inference. A simple gate enables a one-time switch with rollback on drift. We further provide a theoretical analysis for key components in SOMA. Extensive experiments show the effectiveness of SOMA. The source code is provided at: https://github.com/LabRAI/SOMA.

2605.11316 2026-05-13 cs.LG math.OC

Error whitening: Why Gauss-Newton outperforms Newton

Maricela Best McKay, Nathan P. Lawrence, Brian Wetton, R. Bhushan Gopaluni

发表机构 * University of British Columbia(不列颠哥伦比亚大学) University of California, Berkeley(加州大学伯克利分校)

AI总结 本文从函数空间视角分析了为何高斯-牛顿法在实践中优于牛顿法,揭示了高斯-牛顿矩阵通过将损失梯度投影到模型切空间,消除了参数化带来的误差扭曲,这一过程被称为“误差白化”。研究指出,这种特性使得高斯-牛顿法在优化过程中更贴近损失函数本身的结构,从而在多种学习任务中表现出更优的性能。

Comments Neurips preprint

详情
英文摘要

The Gauss-Newton matrix is widely viewed as a positive semidefinite approximation of the Hessian, yet mounting empirical evidence shows that Gauss-Newton descent outperforms Newton's method. We adopt a function space perspective to analyze this phenomenon. We show that the generalized Gauss-Newton (GGN) matrix projects the Newton direction in function space onto the model's tangent space, while a Jacobian-only variant obtained by applying the least squares Gauss-Newton matrix to non-least squares losses projects the function space loss gradient onto this same tangent space. Both projections eliminate distortions from the model's parameterization. Specifically, the evolution of the prediction-target mismatch depends on the model's parameterization through the matrix $JJ^\top$ where $J$ is the Jacobian of the model with respect to its parameters. The projections effectively replace $JJ^\top$ with the identity. We call this effect error whitening. Once the parameterization is removed, the prediction-target mismatch evolves according to dynamics dictated by the structure of the loss and the projection produced by the optimizer. Error whitening is a special property of Gauss-Newton descent that rigorously distinguishes it from Newton's method. We empirically demonstrate that Gauss-Newton optimizers follow the theoretically predicted function space dynamics and outperforms Newton's method, Adam, and Muon across case studies spanning supervised learning, physics-informed deep learning, and approximate dynamic programming.

2605.11312 2026-05-13 cs.AI

Constraint-Data-Value-Maximization: Utilizing Data Attribution for Effective Data Pruning in Low-Data Environments

Danilo Brajovic, David A. Kreplin, Marco F. Huber

发表机构 * Fraunhofer IPA(弗劳恩霍夫研究所) Institute of Industrial Manufacturing and Engineering IFF(工业制造与工程研究所) University of Stuttgart(斯图加特大学) Hochschule Heilbronn(海德堡应用技术大学)

AI总结 本文研究了在数据量有限的情况下如何有效进行数据剪枝的问题,提出了一种基于数据归属的约束数据价值最大化(CDVM)方法。该方法通过将剪枝过程建模为一个受约束的优化问题,在最大化整体数据影响的同时限制单个测试样本的贡献,从而在保留少量数据时仍能保持模型性能。实验表明,CDVM在OpenDataVal基准上表现出色,具有良好的性能和竞争力的运行时间。

Comments Accepted for publication at IJCAI 2026

详情
英文摘要

Attributing model behavior to training data is an evolving research field. A common benchmark is data removal, which involves eliminating data instances with either low or high values, then assessing a model's performance trained on the modified dataset. Many existing studies leverage Shapley-based data values for this task. In this paper, we demonstrate that these data values are not optimally suited for pruning low-value data when only a limited amount of data remains. To address this limitation, we introduce the Constraint-Data-Value-Maximization (CDVM) approach, which effectively utilizes data attributions for pruning in low-data scenarios. By casting pruning as a constrained optimization that both maximizes total influence and penalizes excessive per-test contributions, CDVM delivers robust performance when only a small fraction of the data is retained. On the OpenDataVal benchmark, CDVM shows strong performance and competitive runtime.

2605.11311 2026-05-13 cs.LG cs.CV stat.CO stat.ML

Couple to Control: Joint Initial Noise Design in Diffusion Models

Jing Jia, Liyue Shen, Guanyang Wang

发表机构 * Department of Computer Science(计算机科学系) Rutgers University(罗格斯大学) Department of EECS(电子工程与计算机科学系) University of Michigan(密歇根大学) Department of Statistics(统计学系)

AI总结 该论文研究了扩散模型中初始噪声设计的问题,指出传统方法中假设初始噪声相互独立可能限制了生成效果。作者提出通过设计噪声之间的依赖结构,保持单个噪声仍为标准高斯分布,从而在不改变模型输入分布的前提下,提升多样本生成的多样性与质量。实验表明,该方法在多个主流扩散模型中有效提升了生成多样性,同时保持了图像质量和提示对齐,并在部分指标上优于现有优化方法。

Comments 26 pages

详情
英文摘要

Diffusion models typically generate image batches from independent Gaussian initial noises. We argue that this independence assumption is only one choice within a broader class of valid joint noise designs. Instead, one can specify a coupling of the initial noises: each noise remains marginally standard Gaussian, so the pretrained diffusion model receives the same single-sample input distribution, while the dependence across samples is chosen by design. This reframes initial-noise control from selecting or optimizing individual seeds to designing the dependence structure of a multi-sample gallery. This view gives a general framework for initial-noise design, covering several existing methods as special cases and leading naturally to new coupled-noise constructions. Coupled noise can improve generation on its own without adding sampling cost, and it is flexible enough to serve as a structured initialization for optimization-based pipelines when additional computation is available. Empirically, repulsive Gaussian coupling improves gallery diversity on SD1.5, SDXL, and SD3 while largely preserving prompt alignment and image quality. It matches or outperforms recent test-time noise-optimization baselines on several diversity metrics at the same sampling cost as independent generation. Subspace couplings also support fixed-object background generation, producing diverse, natural backgrounds compared with specialized inpainting baselines, with a tunable trade-off in foreground fidelity.

2605.11307 2026-05-13 cs.CV cs.LG

Vision2Code: A Multi-Domain Benchmark for Evaluating Image-to-Code Generation

Ajay Vikram Periasami, Junlin Wang, Bhuwan Dhingra

发表机构 * Duke University(杜克大学)

AI总结 Vision2Code 是一个用于评估多领域图像到代码生成能力的基准测试框架,旨在检验视觉语言模型能否将图像结构转化为可执行代码。该基准包含来自15个数据集的2,169个测试样例,涵盖图表、几何图形、科学图像等多种领域,并采用基于视觉语言模型的评分机制进行评估,有效区分代码执行错误与重建质量问题。实验表明,模型在不同领域的表现存在显著差异,且通过筛选模型输出作为训练数据可有效提升生成性能。

Comments Project page: https://image2code.github.io/vision2code/

详情
英文摘要

Image-to-code generation tests whether a vision-language model (VLM) can recover the structure of an image enough to express it as executable code. Existing benchmarks either focus on narrow visual domains, depend on paired executable reference code, or rely on generic rubrics that miss domain-specific reconstruction errors. We introduce Vision2Code, a reference-code-free benchmark and evaluation framework for multi-domain image-to-code generation. Vision2Code contains 2,169 test examples from 15 source datasets that span charts and plots, geometry, graphs, scientific imagery, documents, and 3D spatial scenes. Models generate executable programs, which we render and score against the source image using a VLM rater with dataset-specific rubrics and deterministic guardrails for severe semantic failures. We report render-success diagnostics that separate code execution failures from reconstruction quality. Human validation shows that this evaluation protocol aligns better with human judgments than either a generic visual rubric or embedding-similarity baselines. Across nine open-weight and proprietary models, we find that image-to-code performance is domain-dependent: leading models perform well on regular chart- and graph-like visuals but remain weak on spatial scenes, chemistry, documents, and circuit-style diagrams. Finally, we show that evaluator-filtered model outputs can serve as training data to improve image-to-code capability, with Qwen3.5-9B improving from 1.60 to 1.86 on the benchmark without paired source programs. Vision2Code provides a reproducible testbed for measuring, diagnosing, and improving image-to-code generation. Our code and data are publicly available at https://image2code.github.io/vision2code/.

2605.11304 2026-05-13 cs.CV

CheXTemporal: A Dataset for Temporally-Grounded Reasoning in Chest Radiography

Eva Prakash, Yunhe Gao, Chong Wang, Justin Xu, Neal Prakash, Arne Michalson, Seena Dehkharghani, Eun Kyoung Hong, Julie Bauml, Roger Boodoo, Jean-Benoit Delbrouck, Sophie Ostmeier, Curtis Langlotz

发表机构 * Stanford University(斯坦福大学) University of Oxford(牛津大学) University of California, Berkeley(加州大学伯克利分校) HOPPR University Hospital Zurich(苏黎世大学医院)

AI总结 CheXTemporal 是一个用于胸部X光影像时序推理的数据集,旨在解决当前模型在处理胸部影像纵向变化时的不足。该数据集包含配对的前后胸部X光片,并提供了细粒度的时序和空间标注,支持五类疾病进展分类。研究还构建了一个包含28万对影像的弱监督数据集,用于评估模型在时序推理和疾病进展分类任务中的表现,结果表明现有模型在时序推理和空间定位方面仍存在明显局限。

详情
英文摘要

Chest radiograph interpretation requires temporal reasoning over prior and current studies, yet most vision-language models are trained on static image-report pairs and lack explicit supervision for modeling longitudinal change. We introduce CheXTemporal, a dataset for temporally grounded reasoning in chest radiography consisting of paired prior-current chest X-rays (CXR) with finding-level temporal and spatial annotations. The dataset includes a five-class progression taxonomy (new, worse, stable, improved, resolved), localized spatial supervision of pathology, explicit spatial-temporal alignment across paired studies, and multi-source coverage for cross-domain evaluation. We additionally construct a 280K-pair silver dataset with automatically derived temporal and anatomical supervision for large-scale evaluation under weaker supervision. Using these resources, we evaluate multiple state-of-the-art vision-language CXR models on grounding and progression-classification tasks in a zero-shot setting. Across both gold and silver evaluations, current models exhibit consistent limitations in spatial grounding, fine-grained temporal reasoning, and robustness under distribution shift. In particular, models perform substantially better on salient progression categories such as worse than on temporally subtle states such as stable and resolved, suggesting limited modeling of longitudinal disease evolution in chest radiography.

2605.11303 2026-05-13 cs.CL

Predicting Psychological Well-Being from Spontaneous Speech using LLMs

Erfan Loweimi, Sofia de la Fuente Garcia, Saturnino Luz

发表机构 * University of Edinburgh(爱丁堡大学) Centre for Medical Informatics (CMI) Usher Institute University of Edinburgh(医学信息学中心(CMI)乌舍研究所爱丁堡大学)

AI总结 该研究探讨了利用大语言模型(LLMs)从自发性语音中零样本预测 Ryff 心理幸福感(PWB)评分的可行性。研究使用了 PsyVoiD 数据库中 111 名参与者的语音录音,评估了包括 Llama-3、Mistral、Gemma、Phi-4 等在内的 12 个指令微调大模型,并与临床心理学和语言学专家合作设计了领域相关的提示词。实验结果显示,LLMs 能够从语音中提取语义信息,实现高达 0.8 的斯皮尔曼相关系数,同时通过统计分析和关键词云分析增强了预测结果的可解释性。

详情
英文摘要

We investigate the use of Large Language Models (LLMs) for zero-shot prediction of Ryff Psychological Well-Being (PWB) scores from spontaneous speech. Using a few minutes of voice recordings from 111 participants in the PsyVoiD database, we evaluated 12 instruction-tuned LLMs, including Llama-3 (8B, 70B), Ministral, Mistral, Gemma-2-9B, Gemma-3 (1B, 4B, 27B), Phi-4, DeepSeek (Qwen and Llama), and QwQ-Preview. A domain-informed prompt was developed in collaboration with experts in clinical psychology and linguistics. Results show that LLMs can extract semantically meaningful cues from spontaneous speech, achieving Spearman correlations of up to 0.8 on 80\% of the data. Additionally, to enhance explainability, we conducted statistical analyses to characterise prediction variability and systematic biases, alongside keyword-based word cloud analyses to highlight the linguistic features driving the models' predictions.

2605.11301 2026-05-13 cs.AI cs.CL cs.CV

LatentRouter: Can We Choose the Right Multimodal Model Before Seeing Its Answer?

Xueqi Cheng, Yushun Dong

发表机构 * Department of Computer Science(计算机科学系)

AI总结 本文提出了一种名为 LatentRouter 的多模态模型路由方法,旨在根据图像-问题输入的特性,选择最适合的多模态大语言模型。该方法通过构建多模态路由胶囊和模型能力标记,利用潜在状态间的通信来预测各候选模型的性能表现,并结合分布输出头和边界胶囊校正机制提升预测准确性。实验表明,LatentRouter 在多个基准测试中优于现有方法,尤其在需要视觉、布局或推理能力的任务中表现突出。

详情
英文摘要

Multimodal large language models (MLLMs) have heterogeneous strengths across OCR, chart understanding, spatial reasoning, visual question answering, cost, and latency. Effective MLLM routing therefore requires more than estimating query difficulty: a router must match the multimodal requirements of the current image-question input with the capabilities of each candidate model. We propose LatentRouter, a router that formulates MLLM routing as counterfactual multimodal utility prediction. Given an image-question query, LatentRouter extracts learned multimodal routing capsules, represents each candidate MLLM with a model capability token, and performs latent communication between these states to estimate how each model would perform if selected. A distributional outcome head predicts model-specific counterfactual quality, while a bounded capsule correction refines close decisions without allowing residual signals to dominate the prediction. The resulting utility-based policy supports performance-oriented and performance-cost routing, and handles changing candidate pools through shared per-model scoring with availability masking. Experiments on MMR-Bench and VL-RouterBench show that LatentRouter outperforms fixed-model, feature-level, and learned-router baselines. Additional analyses show that the gains are strongest on multimodal task groups where model choice depends on visual, layout-sensitive, or reasoning-oriented requirements, and that latent communication is the main contributor to the improvement. The code is available at: https://github.com/LabRAI/LatentRouter.

2605.11300 2026-05-13 cs.CV

Can Graphs Help Vision SSMs See Better?

Dhruv Parikh, Anvitha Ramachandran, Haoyang Fan, Mustafa Munir, Rajgopal Kannan, Viktor Prasanna

发表机构 * USC(美国南加州大学) UT Austin(德克萨斯大学奥斯汀分校) DEVCOM ARL Army Research Office, USA(美国陆军战争学院研发办公室)

AI总结 本文研究了如何通过图结构改进视觉状态空间模型(Vision SSMs)的性能,提出了一种基于图的动态扫描操作符GraphScan。该方法为每个视觉标记构建局部图结构,学习基于特征的亲和关系,并通过语义邻域的一次消息传递生成输出标记,从而在全局状态空间混合前实现局部语义对齐。实验表明,集成GraphScan的GraphScan-Mamba在多个视觉任务中取得了最先进的性能,且计算开销较小,为未来视觉状态空间模型的扫描机制提供了新的语义导向视角。

Comments Technical Report

详情
英文摘要

Vision state space models inherit the efficiency and long-range modeling ability of Mamba-style selective scans. However, their performance depends critically on the representation of two-dimensional visual features as one-dimensional token sequences. Existing scan operators range from predefined geometric traversals to dynamic coordinate-based samplers that reroute tokens through predicted offsets and interpolation. While effective, these mechanisms primarily adapt paths or sampling locations, rather than explicitly modeling which local patches should exchange information before global state-space mixing. This motivates a simple question: \emph{can graphs help vision state space models see better?} We introduce \textbf{GraphScan}, a graph-induced dynamic scanning operator for Vision SSMs. For each token, GraphScan constructs a spatially bounded local graph, learns feature-conditioned affinities with relative positional bias, and produces the output token by one-step message passing over its semantic neighborhood. The resulting tokens are locally grounded before being processed by the selective SSM for global aggregation. GraphScan preserves token count and linear scaling in image size, while replacing coordinate-conditioned interpolation with feature-conditioned semantic routing. Integrated into a hierarchical backbone, \textbf{GraphScan-Mamba} achieves state-of-the-art performance among Vision SSMs across image classification, object detection, instance segmentation, and semantic segmentation, with modest computational overhead. Our analysis further shows that GraphScan induces interpretable displacement fields over the token lattice, providing a semantic and spatially grounded view of dynamic scanning. These results suggest that future Vision SSMs should treat scanning not merely as geometric serialization, but as learned local semantic routing before global state-space modeling.

2605.11296 2026-05-13 cs.RO cs.SY eess.SY

Computational Design of a Low-Visibility UAV Using a Human-Aligned Perceptual Metric

Jingxian Wang, Chen Yu, David Matthews, Emma Alexander, Sam Kriegman, Michael Rubenstein

发表机构 * Northwestern University(西北大学)

AI总结 本文提出了一种名为 Phantom Twist 的单旋翼无人机设计,通过高速旋转和运动模糊实现低可见性。研究构建了一个两阶段自动化设计流程,优化功能组件的布局,同时满足飞行稳定性要求,并以人类感知对齐的视觉度量(LPIPS)作为优化目标。实验验证表明,该方法生成的无人机具有良好的稳定性和可控性,且相比传统四旋翼无人机,其视觉可察觉性显著降低。

Comments Accepted by RSS 2026

详情
英文摘要

We introduce Phantom Twist, a type of single-propeller UAV designed to achieve low visibility through high-speed spinning and the exploitation of motion blur. We develop a two-stage automated design pipeline that optimizes the placement of functional components including batteries, control PCB, motor-propeller assembly, and counterweights. The pipeline minimizes visibility as measured by a human-aligned perceptual metric (LPIPS) while strictly satisfying inertial and aerodynamic constraints required for stable flight. We validate this approach through fabrication and flight testing of multiple prototypes. These tests confirm that our pipeline produces stable, controllable designs and that the optimized UAV exhibits significantly reduced visual perceptibility compared to conventional quadcopters.

2605.11291 2026-05-13 cs.LG

Optimal Representations for Generalized Contrastive Learning with Imbalanced Datasets

Thuan Nguyen, Shuchin Aeron, D. Richard Brown, Prakash Ishwar

发表机构 * Department of Engineering, Engineering Technology(工程系,工程技术部) Department of Electrical and Computer Engineering(电气与计算机工程系)

AI总结 本文研究了在类别不平衡数据集下对比学习(CL)中最优表示的几何特性。作者证明,当类别不平衡时,同一类别的所有样本的最优表示会坍缩到类均值,并呈现出由类别比例决定的角对称结构。此外,当类别不平衡达到一定阈值时,会出现“少数类坍缩”现象,即少数类样本全部坍缩为一个向量。研究还提出了一个凸优化问题来确定最优表示的几何结构,并通过数值实验验证了理论结果。

Comments 28 pages, 2 figures

详情
英文摘要

In this paper, we provide a computable characterization of the geometry of optimal representations in Contrastive Learning (CL) when the classes are imbalanced. When classes are balanced and the representation dimension is greater than the number of classes, it is well-known that the optimal representations exhibit Neural Collapse (NC), i.e., representations from the same class collapse to their class means and the class means form an Equiangular Tight Frame (ETF). For imbalanced classes and a large, generalized family of CL losses, we prove that the optimal representations of all samples from the same class collapse to their class means and their geometry exhibits an angular symmetry structure that is determined by the relative class proportions. In general, we show that the geometry can be determined by solving a convex optimization problem. Exploiting this symmetry structure, we analytically investigate a special case where class imbalance is extreme and prove that CL exhibits a phenomenon called Minority Collapse (MC) where all samples from the minority classes (classes with small probabilities) collapse into a single vector, whenever the class imbalance exceeds a threshold, which in turn depends on the regularity properties of the CL loss used and on the number of negative samples. Numerical results are provided to illustrate these phenomena and corroborate the theoretical results. We conclude by identifying a number of open problems.

2605.11290 2026-05-13 cs.CL cs.AI

ReAD: Reinforcement-Guided Capability Distillation for Large Language Models

Xueqi Cheng, Xugui Zhou, Tyler Derr, Yushun Dong

发表机构 * Florida State University(佛罗里达州立大学) Louisiana State University(路易斯安那州立大学) Vanderbilt University(范德比大学)

AI总结 本文提出了一种名为 ReAD 的强化引导能力蒸馏框架,旨在在固定 token 预算下更有效地压缩大语言模型,同时保留对下游任务至关重要的能力。该方法通过识别任务关键能力、动态生成针对性监督信号,并利用不确定性感知的上下文老虎机算法优化预算分配,从而在提升任务表现的同时减少能力间的负面干扰和资源浪费。实验表明,ReAD 在相同预算下优于现有方法,具有更高的实用性和效率。

详情
英文摘要

Capability distillation applies knowledge distillation to selected model capabilities, aiming to compress a large language model (LLM) into a smaller one while preserving the abilities needed for a downstream task. However, most existing methods treat capabilities as independent training targets and overlook how improving one capability can reshape the student's broader capability profile, especially when multiple abilities jointly determine task success. We study capability distillation under a fixed token budget and identify two consistent patterns: distillation induces systematic, budget-dependent cross-capability transfer, and additional budget often brings limited task-relevant gains while sometimes degrading other useful abilities. Building on these insights, we propose ReAD, a Reinforcement-guided cApability Distillation framework that explicitly accounts for capability interdependence. ReAD first infers task-essential capabilities, then generates capability-targeted supervision on the fly, and finally uses an uncertainty-aware contextual bandit to adaptively allocate the distillation budget based on expected utility gains. Extensive experiments show that ReAD improves downstream utility under the same token budget while reducing harmful spillover and wasted distillation effort compared to strong baselines. Our code is publicly available at https://github.com/LabRAI/ReAD.

2605.11289 2026-05-13 cs.LG math.OC

Quotient-Categorical Representations for Bellman-Compatible Average-Reward Distributional Reinforcement Learning

Ege C. Kaya, Aliasghar Pourghani, Vijay Gupta, Abolfazl Hashemi

发表机构 * Elmore Family School of Electrical and Computer Engineering(埃尔莫尔电气与计算机工程学院)

AI总结 本文研究平均奖励强化学习中的分布强化学习问题,针对传统方法在实数线上难以直接定义分布形式的挑战,提出了一种基于商空间和分类参数化的表示方法,以处理状态索引偏差律的平移不变性。该方法定义了投影平均奖励分布算子,并证明其具有良好定义性、非扩张性及不动点性质,同时分析了采样递归的收敛性,并在未知增益情况下引入在线估计器,保证了算法的稳定性与收敛性。

Comments 29 pages, 4 figures

详情
英文摘要

Average-reward reinforcement learning requires estimating the gain and the bias, which is defined only up to an additive constant. This makes direct distributional analogues ill-posed on the real line. We introduce a quotient-space formulation in which state-indexed bias laws are identified up to a common translation, together with a categorical parameterization that respects this symmetry. On this quotient-categorical space, we define a projected average-reward distributional operator and show that it is well-defined, non-expansive in a coordinate Cramér metric, and admits fixed points. We then study sampled recursions whose mean-field maps are asynchronous relaxations of this operator. In an idealized centered-reward setting, a one-state temporal-difference update enjoys almost sure convergence together with finite-iteration residual bounds under both i.i.d. and Markovian sampling. When the gain is unknown, we augment the recursion with an online gain estimator, and prove non-expansiveness and Markovian convergence of the resulting coupled scheme. Finally, we show that synchronous exact updates are gain-independent at the quotient-law level, isolating a structural contrast between ideal quotient distributions and practical fixed-grid categorical representations.

2605.11276 2026-05-13 cs.CV cs.AI

Generative AI for Visualizing Highway Construction Hazards Through Synthetic Images and Temporal Sequences

Trevor Neece, Mason Smetana, Lev Khazanovich

发表机构 * University of Pittsburgh(匹兹堡大学)

AI总结 该研究提出了一种基于生成式人工智能的方法,用于从OSHA严重伤害报告中生成高速公路施工危险场景的合成图像和时间序列,以辅助安全培训。研究开发了两种生成模式:单图生成和四阶段时间序列生成,并通过CLIP语义检索和专家评估对生成图像的教育价值、真实感和对齐度进行了多维评价。该方法在无需拍摄真实事故场景的情况下,为安全培训提供了可视化素材,同时为跨领域合成图像生成提供了新的评估框架。

详情
英文摘要

Highway construction workers face a high risk of serious injury or death. Image-based training materials depicting hazardous scenarios are essential for engaging safety instruction but remain scarce due to ethical and logistical barriers. This study develops and evaluates a generative AI methodology for producing synthetic visualizations of highway construction hazards from OSHA Severe Injury Report narratives. Two modes were developed: a single-pass approach yielding one image per incident, and a temporal approach producing a four-stage sequence. A sample of 75 incident records yielded 750 images, evaluated using CLIP-based semantic retrieval and expert assessment across dimensions such as educational utility, fidelity, and alignment. Single-pass images achieved 81.1% educational acceptability with fidelity and alignment scores of 4.14/5 and 4.07/5, respectively, while temporal sequences achieved 60.9% acceptability with comparable alignment (3.94/5) but lower fidelity (3.51/5). CLIP-based retrieval revealed that both modes produce images with statistically significant retrieval capabilities. This is among the first studies to leverage modern autoregressive image generation models for visualizing construction hazards from reported severe injuries and to generate temporally sequenced hazard imagery, and a new multi-dimensional evaluation framework was developed to support future research in this domain. The work enables safety trainers to pair narrative storytelling with visual learning material without photographing real-world hazards, and the framework could be applied to datasets across diverse domains, enabling synthetic image generation tailored to new application areas.

2605.11272 2026-05-13 cs.LG cs.AI cs.IR

Localization Boosting for Growth Markets: Mitigating Cross-Locale Behavioral Bias in Learning-to-Rank

Suryaa Veerabathiran Seran, Ashwin Naresh Kumar, Tracy Holloway King, Jing Zheng

发表机构 * Adobe

AI总结 本文研究了在国际扩张阶段,如何缓解学习排序(LTR)模型在不同地区之间的行为偏差问题。作者指出,仅依赖点击数据训练的模型会忽视语义层面的本地化特征,导致非美国地区的内容曝光不均。为此,他们提出了一种结合行为反馈、视觉语言模型相关性信号和地域感知增强的多目标框架,有效提升了模型在多个地区的相关性和本地内容可见性。

详情
英文摘要

Adobe Express is expanding internationally, but the US has a disproportionately large content supply and interaction volume. Learning-to-rank (LTR) models trained primarily on behavioral feedback inherit this imbalance: templates popular in US are over-served in non-US locales. This cross-locale exposure bias suppresses local content discoverability and degrades ranking quality in growth locales. We show that click-only training suppresses semantically informative localization features. Adding vision-language model (VLM) graded relevance labels as auxiliary supervision alongside clicks improves semantic alignment but does not preserve local content visibility. We propose a multi-objective framework combining behavioral supervision, VLM-derived relevance signals, and locale-aware boosting. Across five locales, the resulting model improves relevance while restoring stable localization, demonstrating the importance of disentangling exposure from semantic supervision.

2605.11267 2026-05-13 cs.CV

Real-Scale Island Area and Coastline Estimation using Only its Place Name or Coordinates

Quanyun Wu, Kyle Gao, Wentao Sun, Hongjie He, Yuhao Chen, David A. Clausi, Jonathan Li

发表机构 * East China Normal University(东华大学)

AI总结 本文提出了一种基于单目视觉的几何一致、真实尺度海岛面积与海岸线测量框架,仅需输入目标区域的地理坐标或名称即可自动获取低空环绕图像序列,并通过轻量轨迹对齐算法恢复全局物理尺度,最终实现高精度的二维平面面积和周长提取。该方法无需依赖传统GIS数据,大幅降低了测绘成本,实验表明其测量误差稳定在10%左右,具有较高的精度、鲁棒性和推理效率,为大规模海洋与海岸线监测提供了实用新范式。

Comments Accepted for publication at IEEE OCEANS (Sanya) 2026

详情
英文摘要

Accurate measurement of island area and coastline length is crucial for coastal zone monitoring and oceanographic analysis. However, traditional measurement and mapping methods usually rely heavily on orthophotos, expensive airborne depth sensors, or dense ground control points, which face serious limitations of high labor costs, time-consuming efforts, and low operational efficiency in vast and inaccessible open sea environments. To overcome these challenges and break away from the reliance on manual field exploration, this paper proposes a geometrically consistent, real-scale island measurement framework based on pure monocular vision. This project significantly reduces the mapping cost through a fully automated process and achieves high-efficiency measurement without prior GIS data. In our system pipeline, only the geographical coordinates or names of the target area need to be input to obtain a low-altitude surrounding image sequence. After obtaining the point clouds, a lightweight trajectory alignment algorithm (Umeyama) is used to restore the global physical scale, and the scaled model is orthorectified, enabling high-precision area and perimeter extraction directly on the 2D rasterized plane. We have fully verified this pipeline on four islands with different terrain features (covering natural landform islands and islands with complex artificial facilities). The experimental results show that the final measurement error of the system is stable at around 10\%, demonstrating excellent accuracy and robustness. Moreover, this framework has outstanding inference speed, requiring only 70 ms to process a single high-resolution image and generate point clouds, providing a highly practical new paradigm for large-scale marine and coastline

2605.11266 2026-05-13 cs.CV cs.GR cs.LG

PG-3DGS: Optimizing 3D Gaussian Splatting to Satisfy Physics Objectives

Zachary Lee, Maxwell Jacobson, Yexiang Xue

发表机构 * Department of Computer Science, Purdue University(普渡大学计算机科学系)

AI总结 该研究提出了一种名为PG-3DGS的物理引导三维高斯点绘方法,旨在生成不仅视觉逼真而且具备物理功能的三维结构。通过将可微分物理模拟与三维高斯表示相结合,该方法能够在优化形状时同时考虑视觉损失和物理目标,从而生成如能倒水的茶壶和能产生升力的飞机等具有实际功能的物体。实验表明,PG-3DGS在保持视觉质量的同时显著提升了物理功能,并在实际风洞测试中验证了其生成结构的物理性能优势。

Comments Submitted to Artificial Intelligence. 52 pages

详情
英文摘要

Recent advances in Gaussian Splatting have enabled fast, high-fidelity 3D scene generation, yet these methods remain purely visual and lack an understanding of how shapes behave in the physical world. We introduce Physics-Guided 3D Gaussian Splatting (PG-3DGS), a framework that couples differentiable physics simulation with 3D Gaussian representations to generate 3D structures satisfying physics functionalities. By allowing physical objectives to guide the shape optimization process alongside visual losses, our approach produces geometries that are not only photometrically accurate but also physically functional. The model learns to adjust shapes so that the generated objects exhibit physically meaningful behaviors, for example, teapots that can pour and airplanes that can generate lift, without sacrificing visual quality. Experiments on pouring and aerodynamic lift tasks show that PG-3DGS improves physical functionality while preserving visual quality. In addition to simulation gains, bench-top physical lift tests with 3D-printed aircraft (Cessna, B-2 Spirit, and paper plane) under identical airflow conditions show higher scale-measured lift for PG-3DGS, generated structures than an appearance-matching baseline in all three cases. Our unified framework connects appearance-based reconstruction with physics-based reasoning, enabling end-to-end generation of 3D structures that both look realistic and function correctly.

2605.11265 2026-05-13 cs.CV cs.AI cs.LG

DenseTRF: Texture-Aware Unsupervised Representation Adaptation for Surgical Scene Dense Prediction

Guiqiu Liao, Matjaž Jogan, Daniel A. Hashimoto

发表机构 * GRASP Laboratory, University of Pennsylvania(宾夕法尼亚大学GRASP实验室) PCASO Laboratory, Department of Surgery, University of Pennsylvania(宾夕法尼亚大学外科PCASO实验室) Department of Computer and Information Science, University of Pennsylvania(宾夕法尼亚大学计算机与信息科学系)

AI总结 本文提出了一种名为DenseTRF的自监督表征适应框架,用于解决手术场景中密集预测任务(如分割和手术区域识别)在跨域部署时因分布偏移导致的性能下降问题。该方法基于纹理感知的注意力机制,通过学习具有不变视觉结构的表征,并在无监督条件下将其适配到目标分布,从而显著提升了模型对领域变化的鲁棒性。实验表明,DenseTRF在多种手术场景中均优于当前最先进的分割模型和跨域适应方法。

Comments Accepted to 29th International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI 2026)

详情
英文摘要

Dense prediction tasks in surgical computer vision, such as segmentation and surgical zone prediction, can provide valuable guidance for laparoscopic and robotic surgery. However, these models often suffer from distribution shifts, as training datasets rarely cover the variability encountered during deployment, leading to poor generalization. We propose DenseTRF, a self-supervised representation adaptation framework based on texture-centric attention. Our method leverages slot attention to learn texture-aware representations that capture invariant visual structures. By adapting these representations to the target distribution without supervision, DenseTRF significantly improves robustness to domain shifts. The framework is implemented through conditioning dense prediction on slot attention and model merging strategies. Experiments across multiple surgical procedures demonstrate improved cross-distribution generalization in comparison to state-of-the-art segmentation models and test-distribution adaptation methods for dense prediction tasks.

2605.11260 2026-05-13 cs.LG cs.AI

Curriculum Learning-Guided Progressive Distillation in Large Language Models

Jincheng Cao, Fanzhi Zeng, Leqi Liu, Aryan Mokhtari

发表机构 * The University of Texas at Austin(德克萨斯大学奥斯汀分校) Google Research(谷歌研究)

AI总结 知识蒸馏是将大语言模型能力转移到小型学生模型的重要技术,但现有方法常忽略训练数据的学习顺序和师生模型容量不匹配的问题。本文提出了一种由课程学习引导的渐进式蒸馏框架(CLPD),通过将数据难度与教师模型能力对齐,同时构建显式和隐式的课程学习机制,有效提升了蒸馏效果。实验表明,CLPD在多个推理基准测试中优于传统蒸馏方法及其他单一优化策略,突显了联合考虑数据顺序与教师容量的重要性。

详情
英文摘要

Knowledge distillation is a key technique for transferring the capabilities of large language models (LLMs) into smaller, more efficient student models. Existing distillation approaches often overlook two critical factors: the learning order of training data and the capacity mismatch between teacher and student models. This oversight limits distillation performance, as manifested by the counter-intuitive phenomenon where stronger teachers fail to produce better students. In this work, we propose Curriculum Learning-Guided Progressive Distillation (CLPD), a unified framework that explicitly accounts for both factors by aligning data difficulty with teacher strength. CLPD constructs an explicit curriculum by organizing training examples from easy to hard, while simultaneously applying an implicit curriculum over supervision signals by progressively scheduling teachers of increasing capacity. Our framework is modular and can be integrated into standard distillation algorithms with minimal overhead. Empirical results on the reasoning benchmarks demonstrate that CLPD consistently outperforms standard distillation, data ordering alone, and teacher scheduling alone across multiple settings. These findings highlight the importance of jointly considering data ordering and teacher capacity when distilling reasoning abilities into small language models.

2605.11259 2026-05-13 cs.AI

Template-as-Ontology: Configurable Synthetic Data Infrastructure for Cross-Domain Manufacturing AI Validation

Grama Chethan

发表机构 * Siemens Digital Industries Software(西门子数字工业软件)

AI总结 本文提出了一种名为“Template-as-Ontology”的可配置合成数据基础设施,用于跨领域制造环境中AI系统的验证。该方法通过一个统一的Python配置模块,同时定义制造仿真器的结构和AI分析工具的运行时数据模式,从而确保数据结构的一致性。实验表明,该框架能够生成符合MES标准的高质量合成数据,并有效减少AI工具在参数生成时的错误率,为离散制造AI的验证提供了可复用的数据基础。

Comments 18 pages, 1 fugure

详情
英文摘要

LLarge language model (LLM)-based AI agents deployed in manufacturing environments require populated, schema-correct data for validation, yet production MES data is proprietary, privacy-encumbered, and vendor-specific. This paper introduces the Template-as-Ontology principle: a single Python configuration module (700-770 lines, 45 validated exports) serves simultaneously as the specification for a time-stepped manufacturing simulator and as the runtime domain schema for AI analytics tools, producing alignment by construction rather than integration. We formally define the domain template as a typed relational configuration schema and prove that structural alignment between simulation and tool layers is guaranteed by single-source consumption. A five-layer pipeline--simulation, PostgreSQL, CDC/Iceberg lakehouse, star schema, and 12 parameterized AI tools--generates causally coherent, MES-shaped data spanning 66 entity types across four operational domains mapped to ISA-95/IEC 62264. We validate the architecture with six industry templates (aerospace, pharma, automotive, electronics, beverages, warehousing) running on identical framework code. Calibration experiments (60 runs, 10 seeds per template) confirm parametric controllability: observed KPIs fall within configured ranges across all templates. A controlled hallucination experiment (72 tool invocations, Qwen3-32B) demonstrates that ontology-constrained parameters eliminate tool-parameter fabrication (0% constrained vs. 43% unconstrained hallucination rate for the evaluated model, Fisher's exact test p < 10^-12); the 0% constrained rate is an architectural guarantee that holds for any model. The framework provides a reusable data layer for discrete manufacturing AI validation.

2605.11258 2026-05-13 cs.AI cs.CL q-bio.QM

Unlocking LLM Creativity in Science through Analogical Reasoning

Andrew Shen, Shaul Druckmann, James Zou

发表机构 * Stanford University(斯坦福大学)

AI总结 本文研究如何通过类比推理(Analogical Reasoning, AR)提升大型语言模型(LLM)在科学问题中的创造力,特别是在生物医学等复杂领域。作者发现现有LLM在开放性问题求解中容易陷入模式崩溃,生成多样性不足的解,为此提出AR方法,通过跨领域问题的类比结构生成新颖解决方案。实验表明,AR显著提升了生成解的多样性和新颖性,并在多个生物医学任务中取得了优于现有方法的性能,验证了其在实际应用中的有效性。

详情
英文摘要

Autonomous science promises to augment scientific discovery, particularly in complex fields like biomedicine. However, this requires AI systems that can consistently generate novel and diverse solutions to open-ended problems. We evaluate LLMs on the task of open-ended solution generation and quantify their tendency to mode collapse into low-diversity generations. To mitigate this mode collapse, we introduce analogical reasoning (AR) as a new approach to solution generation. AR generates analogies to cross-domain problems based on shared relational structure, then uses those analogies to search for novel solutions. Compared to baselines, AR discovers significantly more diverse generations (improving solution diversity metrics by 90-173%), generates novel solutions over 50% of the time (compared to as little as 1.6% for baselines), and produces high-quality analogies. To validate the real-world feasibility of AR, we implement AR-generated solutions across four biomedical problems, yielding consistent quantitative gains. AR-generated approaches achieve a nearly 13-fold improvement on distributional metrics for perturbation effect prediction, outperform all baselines on AUPRC when predicting cell-cell communication, infer brain region interactions with a high Spearman correlation ($ρ$=0.729) to published methods, and establish state-of-the-art performance on 2 datasets for oligonucleotide property prediction. The novel and diverse solutions produced by AR can be used to augment the search space of existing solution generation methods.

2605.11255 2026-05-13 cs.CL

HEBATRON: A Hebrew-Specialized Open-Weight Mixture-of-Experts Language Model

Noam Kayzer, Dan Revital, Ori Bar Joseph, Smadar Arvatz, Or Levi, Tal Geva, Shaltiel Shmidman, Amir DN Cohen, Noam Ordan, Omer Baruch, Kate Zinkovskaia, Zevi Apini, Sarel Weinberger

发表机构 * PwC(普华永道)

AI总结 本文介绍了Hebatron,一个基于NVIDIA Nemotron-3稀疏专家混合架构的希伯来语专用开源大语言模型。该模型通过三阶段由易到难的训练课程和持续抗遗忘锚定策略进行训练,并在200万对希伯来语-英语样本上进行监督微调,显著提升了模型性能。Hebatron在希伯来语推理任务中达到73.8%的平均准确率,优于多个现有模型,同时在保持高推理吞吐量和长上下文支持方面表现出色。这是首个针对希伯来语的Nemotron-3架构适配模型,也是首个支持原生长上下文的开源希伯来语专家混合模型。

详情
英文摘要

We present Hebatron, a Hebrew-specialized open-weight large language model built on the NVIDIA Nemotron-3 sparse Mixture-of-Experts architecture. Training employs a three-phase easy-to-hard curriculum with continuous anti-forgetting anchoring, followed by supervised fine-tuning on 2 million bilingual Hebrew--English samples. The curriculum ordering alone yields a 3-point aggregate benchmark gain over the reversed configuration. Hebatron achieves a Hebrew reasoning average of 73.8\%, outperforming DictaLM-3.0-24B-Thinking (68.9\%) and remaining competitive with Gemma-3-27B-IT on GSM8K-HE and Israeli Trivia, while activating only 3B parameters per forward pass across a 30B-parameter model, delivering approximately 9 times higher inference throughput at native context lengths up to 65,536 tokens. To our knowledge, this is the first language-specific adaptation of the Nemotron-3 architecture for any target language, and the first open-weight Hebrew-specialized MoE model with native long-context support. Model weights are released openly to support further research in Hebrew and Semitic-language NLP.

2605.11247 2026-05-13 cs.LG

A Proof-of-Concept Simulation-Driven Digital Twin Framework for Decision-Aware Diabetes Modeling

Zarrin Monirzadeh

发表机构 * Software & Data Engineer | ML & AI Systems(软件与数据工程师 | 机器学习与人工智能系统)

AI总结 本文提出了一种基于仿真驱动的数字孪生框架,用于支持决策感知的糖尿病建模,利用基准临床数据、合成时间增强和连续血糖监测分析进行验证。该框架不同于传统预测模型,重点生成可解释的仿真轨迹而非临床验证结果,并通过公共数据集与受控合成场景评估其性能,展示了预测与反事实仿真的结合在决策分析中的可行性。该工作为未来医疗领域仿真驱动的数字孪生系统研究提供了基础。

Comments Preprint. 9 figures. DOI: 10.5281/zenodo.20127363

详情
英文摘要

This paper presents a proof-of-concept digital twin framework for simulation-driven diabetes modeling using benchmark clinical data, synthetic temporal augmentation, and illustrative continuous glucose monitoring (CGM) analysis. Unlike traditional predictive models, the framework focuses on generating interpretable simulated trajectories rather than clinically validated outcomes. Evaluation is conducted using a public dataset combined with controlled synthetic scenarios to illustrate temporal behavior and intervention effects. Results illustrate the feasibility of integrating prediction with counterfactual simulation for decision-aware analysis. This work does not claim clinical readiness but provides a foundation for future research on simulation-driven digital twin systems in healthcare.

2605.11242 2026-05-13 cs.CL cs.AI

RETUYT-INCO at BEA 2026 Shared Task 2: Meta-prompting in Rubric-based Scoring for German

Ignacio Sastre, Ignacio Remersaro, Facundo Díaz, Nicolás De Horta, Luis Chiruzzo, Aiala Rosá, Santiago Góngora

发表机构 * Instituto de Computación, Facultad de Ingeniería, Universidad de la República(计算研究所,工程学院,乌拉圭共和国大学)

AI总结 本文介绍了 RETUYT-INCO 团队在 BEA 2026 共享任务“基于评分标准的德语短答案评分”中的参与情况,团队在多个子任务中采用了一种名为 Meta-prompting 的方法,通过从训练集示例中生成定制提示来对学生的答案进行评分。除了该方法,团队还尝试了传统机器学习、开源大模型微调及其他提示技术。最终在多个子任务中取得了中等偏上的排名,展示了方法的有效性与多样性。

Comments To be presented at the BEA 2026 workshop, co-located with ACL 2026

详情
英文摘要

In this paper, we present the RETUYT-INCO participation at the BEA 2026 shared task "Rubric-based Short Answer Scoring for German". Our team participated in track 1 (Unseen answers three-way), track 3 (Unseen answers two-way) and track 4 (Unseen questions two-way). Since these tracks required scoring short student answers using specific rubrics, we looked for ways to handle the changing nature of the task. We created a method called Meta-prompting. In this approach, an LLM creates a custom prompt based on examples from the Train set. This prompt is then used to grade new student answers. Along with this method, we also describe other approaches we used, such as classic machine learning, fine-tuning open-source LLMs, and different prompting techniques. According to the official results, our team placed 6th out of 8 participants in Track 1 with a QWK of 0.729. In Track 3, we secured 4th place out of 9 with a QWK of 0.674, and we also placed 4th out of 8 in Track 4 with a QWK of 0.49.

2605.11239 2026-05-13 cs.LG stat.ML

Extending Kernel Trick to Influence Functions

Zhenhuan Sun, Shahrokh Valaee

发表机构 * University of Toronto(多伦多大学)

AI总结 本文提出了一种影响函数的对偶表示方法,其计算复杂度随数据集规模增长而非模型规模,为大规模模型的影响分析提供了更高效的替代方案。该方法适用于可线性化的模型,通过构造一个与模型输出维度和数据集规模乘积相关的矩阵实现,能够在参数空间难以计算原始影响函数时有效估计参数、模型输出和损失的变化。这一成果在模型规模远大于数据集规模时具有显著优势。

详情
英文摘要

In this paper, we present a dual representation of the influence functions, whose computational complexity scales with dataset size rather than model size. Both analytically and experimentally, we show that this representation can be an efficient alternative to the original influence functions for estimating changes in parameters, model outputs and loss due to data point removal, when model size is large relative to dataset size, or when evaluating the original influence functions in parameter space is infeasible. The dual representation, however, is limited to linearizable models, which are models whose behavior can be approximated by their linearizations throughout training, and requires materializing a matrix, whose size grows with the product of model output dimension and dataset size.

2605.11237 2026-05-13 cs.LG

DeconDTN-Toolkit: A Library for Evaluation and Enhancement of Robustness to Provenance Shift

Yongsen Tan, Zhecheng Sheng, Xiruo Ding, Serguei V. S. Pakhomov, Trevor Cohen

发表机构 * University of Washington(华盛顿大学) University of Minnesota(明尼苏达大学)

AI总结 本文研究了在部署阶段数据来源与标签关系发生变化的“来源偏移”问题,提出了一个基于反事实不变性与不变学习的鲁棒性学习目标。为此,作者开发了DeconDTN-Toolkit工具包,用于模拟不同程度的来源偏移并评估现有算法的鲁棒性,揭示了经验风险最小化在来源偏移下的脆弱性,并提出了新的分布外性能指标,为来源混淆问题的分析与缓解提供了理论支持与实用工具。

Comments Accepted to CHIL 2026

详情
英文摘要

Despite the burgeoning body of work on distribution shifts, provenance shift-where the relationship between data source and label changes at deployment-remains poorly understood and under-addressed. In this paper, we establish a formal connection between provenance shift, counterfactual invariance, and invariant learning to derive a learning objective for robustness. We then introduce \textsc{DeconDTN-Toolkit}, a specialized evaluation and remediation suite designed to simulate provenance shifts of varying degrees while maintaining the training protocol and the infrastructure of existing benchmarks. We reveal the vulnerability of Empirical Risk Minimization under provenance shift, introduce a robust out-of-distribution performance indicator, and conduct a comprehensive evaluation on existing algorithms. Our work provides both the theoretical grounding and the practical tools necessary to characterize the problem of confounding by provenance, and implementations of methods to mitigate it.