arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2606.06977 2026-06-08 cs.RO 新提交

Compliance-Based Sensor Placement for Force Sensing on a Sensorized Prostate Phantom

基于柔顺性的传感器布局方法用于传感化前列腺模体的力感知

Sizhe Tian, Yinoussa Adagolodjo, Jeremie Dequidt

发表机构 * CRIStAL ； DEFROST ； Polytech Lille

AI总结提出一种基于柔顺性的加权贪心传感器布局方法，用于直肠指检训练模体的力感知，相比全局QR方法将目标区域力重构性提高22.5%。

详情

AI中文摘要

本文提出一种基于柔顺性的传感器布局方法，用于为直肠指检训练设计的传感化前列腺模体的力感知。该模体结合了三个内部气动腔室（用作内置压力传感器）和十个表面位移标记。通过在外表面采样位置施加外力生成有限元仿真数据集，并构建将力输入与压力和位移响应关联的柔顺矩阵。基于该矩阵，我们提出一种加权贪心选择策略，最大化局部力可重构性，同时优先考虑临床相关的后部接触区域，并避免将标记直接放置在感兴趣区域内。与全局基于QR的布局策略相比，所提方法将目标区域的平均可重构性得分提高了22.5%。这些结果表明，区域感知的稀疏传感器布局可以在保持有限且实用的传感配置的同时，提高软体机器人医疗模体的力可观测性。

英文摘要

This work presents a compliance-based sensor placement method for force sensing on a sensorized prostate phantom designed for Digital Rectal Examination training. The phantom combines three internal pneumatic chambers, used as intrinsic pressure sensors, with ten surface displacement markers. A finite-element simulation dataset is generated by applying external forces at sampled surface locations, from which a compliance matrix relating force inputs to pressure and displacement responses is constructed. Based on this matrix, we propose a weighted greedy selection strategy that maximizes local force reconstructability while prioritizing the clinically relevant posterior contact region and avoiding marker placement directly within the Region of Interest. Compared with a global QR-based placement strategy, the proposed method increases the mean reconstructability score in the target region by 22.5%. These results suggest that region-aware sparse sensor placement can improve force observability in soft robotic medical phantoms while maintaining a limited and practical sensing configuration.

URL PDF HTML ☆

赞 0 踩 0

2606.06976 2026-06-08 cs.AI 新提交

Exploring Agentic Tool-Calling Decisions via Uncertainty-Aligned Reinforcement Learning

通过不确定性对齐强化学习探索智能体工具调用决策

Yijin Zhou, Linqian Zeng, Xiaoya Lu, Wenyuan Xie, Dongrui Liu, Junchi Yan, Jing Shao

发表机构 * Shanghai Jiao Tong University（上海交通大学）； Shanghai Artificial Intelligence Laboratory（上海人工智能实验室）； Shanghai Innovation Institute（上海创新研究院）

AI总结针对智能体工具调用中错误累积问题，提出TRUST方法，将不确定性量化作为排斥力融入奖励设计，并标注轻量关键轮次用于多轮轨迹统一后训练，显著提升决策质量与智能体性能。

详情

AI中文摘要

基于大语言模型的智能体经常做出次优的工具使用决策，包括不支持的工具调用和幻觉式的直接响应，这可能在多步交互中累积错误。现有方法主要通过推理时校正或基于决策结果和结构化检查表的粗粒度奖励信号来改善这些行为，而忽略了智能体决策的不确定性特征。我们观察到，面向决策的强化学习倾向于削弱正确和错误动作之间的不确定性分离，导致过度自信的错误和较弱的探索信号。因此，我们提出TRUST，将不确定性量化作为排斥力融入奖励设计以维持不确定性分离，并标注轻量级关键轮次注释用于多轮轨迹的统一后训练。在多种工具使用基准上的实验结果表明，TRUST在优化过程中持续提升决策质量和智能体性能，同时保持更可靠的不确定性估计。

英文摘要

Large language model (LLM)-based agents often make suboptimal tool-use decisions, including unsupported tool invocation and hallucinated direct responses, which may accumulate errors throughout multi-step interactions. Existing approaches mainly improve these behaviors through inference-time correction or coarse-grained reward signals based on decision outcomes and structured checklists, leaving the uncertainty characteristics of agent decisions underexplored. We observe that decision-oriented reinforcement learning tends to weaken the uncertainty separation between correct and incorrect actions, resulting in overconfident mistakes and weaker exploration signals. Therefore, we propose TRUST, which incorporates uncertainty quantification into reward design as a repulsive force for maintaining uncertainty separation, and labels lightweight key-turn annotations for unified post-training of multi-turn trajectories. Experimental results across diverse tool-use benchmarks show that TRUST consistently enhances both decision quality and agent performance while maintaining more reliable uncertainty estimates during optimization.

URL PDF HTML ☆

赞 0 踩 0

2606.06975 2026-06-08 cs.SD eess.AS 新提交

MyGardenBird: A Machine-Learning-Ready Bird Sound Dataset for Twelve Common Malaysian Birds

MyGardenBird：针对十二种常见马来西亚鸟类的机器学习就绪鸟声数据集

Muhammad Mun'im Ahmad Zabidi, Mohd Yamani Idna Idris, Norisma Idris

发表机构 * Faculty of Computer Science and Information Technology, Universiti Malaya（马来大学计算机科学与信息技术学院）； Faculty of Electrical Engineering, Universiti Teknologi Malaysia（技术学院电气工程学院）

AI总结提出MyGardenBird数据集，包含来自Xeno-canto的12种马来西亚常见鸟类的7200个经过人工验证的音频片段，通过卷积神经网络基线实验达到92-96%的分类准确率。

Comments 17 pages, 9 figures

详情

AI中文摘要

来自热带地区的生物声学数据集仍然有限，部分原因是缺乏可重复的工作流程来聚合来自公共档案的录音。我们提出了\textbf{MyGardenBird}，一个精心策划的鸟类发声数据集，代表马来西亚半岛和印度-马来亚地区的十二种常见物种。录音来自Xeno-canto，并通过物种级过滤、手动频谱图分割和质量控制检查进行处理。主要版本包含7,200个人工验证的音频片段（16 kHz，16位PCM单声道WAV），每个物种平衡600个三秒片段（总计6.0小时），来自1,381个不同的录音。元数据包括地理空间坐标、发声类别和信噪比（SNR）值（范围：0.83--59.18 dB；平均值：15.80 dB）。还提供了一个44.1 kHz的补充版本。为了减轻数据泄漏，数据集划分在源录音级别定义。使用卷积神经网络在梅尔频谱图上的基线分类实验达到了92--96%的测试准确率，表明种间可分性强。局限性包括依赖单一标注者进行策展；然而，使用BirdNET进行的验证确认了标签一致性。MyGardenBird在CC BY-NC-SA 4.0许可下于该https URL公开提供。随附完整的预处理代码以支持可重复性和未来扩展。

英文摘要

Bioacoustic datasets from tropical regions remain limited, in part due to the absence of reproducible workflows for aggregating recordings from public archives. We present \textbf{MyGardenBird}, a curated dataset of bird vocalisations representing twelve common species across Peninsular Malaysia and the Indo-Malayan region. Recordings were sourced from Xeno-canto and processed through species-level filtering, manual spectrogram segmentation, and quality control checks. The primary release comprises 7,200 manually validated audio clips (16 kHz, 16-bit PCM mono WAV), balanced at 600 three-second clips per species (6.0 hours total) derived from 1,381 distinct recordings. Metadata includes geospatial coordinates, vocalisation categories, and signal-to-noise ratio (SNR) values (range: 0.83--59.18 dB; mean: 15.80 dB). A supplementary 44.1 kHz version is also provided. To mitigate data leakage, dataset partitions are defined at the source-recording level. Baseline classification experiments using convolutional neural networks on Mel-spectrograms achieved test accuracies of 92--96\%, indicating strong interspecies separability. Limitations include reliance on single-annotator curation; however, validation with BirdNET confirmed label consistency. MyGardenBird is openly available at https://doi.org/10.5281/zenodo.20306877 under a CC BY-NC-SA 4.0 licence. Complete preprocessing code accompanies the release to support reproducibility and future expansion.

URL PDF HTML ☆

赞 0 踩 0

2606.06972 2026-06-08 cs.AI 新提交

Accounting for Context: Shaping Moral Credences for Value Alignment

考虑情境：塑造道德信念以实现价值对齐

Jazon Szabo, Sanjay Modgil

发表机构 * University of Oxford（牛津大学）

AI总结本文针对价值对齐中道德多元性问题，提出在聚合道德评估时必须考虑情境因素，并形式化道德不确定性下的决策，揭示弱帕累托原则的违反是辛普森悖论的一种变体。

详情

AI中文摘要

确保智能体行为与人类道德价值观对齐不可避免地引发一个问题：如何解释社会乃至个体通常采纳的多元道德视角。关于道德不确定性的工作提出了在不同道德理论之间公平且民主地聚合行动评估的机制。然而，本文认为在聚合道德评估时需要考虑情境因素。例如，后果主义视角假设能够准确确定智能体的行动如何改变世界；这一假设在现实世界中往往不成立。因此，我们在考虑这些情境因素的同时，形式化了道德不确定性下的智能体决策。我们由此表明，一个看似常识性的属性——弱帕累托原则——被违反了。我们认为，这个看似的问题实际上是辛普森悖论的一种变体，因此揭示了忽视情境因素影响的聚合机制的局限性。

英文摘要

Ensuring that agent behaviours are aligned with human moral values inevitably raises the problem of how to account for the plurality of moral perspectives that societies -- and even individuals -- typically adopt. Work on moral uncertainty proposes mechanisms to fairly and democratically aggregate evaluations of actions across different moral theories. However, this paper argues that one needs to account for contextual factors when aggregating moral evaluations. For example, consequentialist perspectives assume an ability to accurately determine how an agent's actions change the world; an assumption that often does not hold in real world settings. We, therefore, formalise agent decision making under moral uncertainty, while also accounting for these kinds of contextual factors. We thereby show that a seemingly commonsensical property -- the weak Pareto principle -- is violated. We argue that this apparent problem is, in fact, a variation of Simpson's paradox, and hence reveals the limitations of aggregation mechanisms that ignore the impact of contextual factors.

URL PDF HTML ☆

赞 0 踩 0

2606.06967 2026-06-08 cs.LG 新提交

GenPO++: Generative Policy Optimization with Jacobian-free Likelihood Ratios

GenPO++：基于无雅可比似然比的生成式策略优化

Ke Hu, Shutong Ding, Panxin Tao, Jingya Wang, Ye Shi

发表机构 * ShanghaiTech University（上海科技大学）

AI总结提出GenPO++框架，利用高阶可逆ODE求解器中的历史状态作为辅助记忆，实现精确可逆映射，从而无偏且高效地计算生成流策略的似然比，在连续控制任务中优于现有方法。

详情

AI中文摘要

生成式策略提供表达性强且多模态的动作分布，使其在复杂连续控制任务的强化学习（RL）中具有吸引力。其中，基于流的策略尤其吸引人，因为它们通过确定性传输映射生成动作。然而，将此类生成式策略应用于基于似然的在线学习仍然受到评估已执行动作概率的困难限制。现有的流RL方法要么用近似替代品替换真实的动作密度比，这可能会引入有偏更新，要么通过虚拟动作增广恢复精确似然，这会扩大策略空间并增加计算量。在这项工作中，我们提出GenPO++，一种可逆生成式策略优化框架，它使用高阶可逆ODE求解器中的历史状态作为辅助记忆，在不改变原始动作维度的情况下实现精确反演。由此产生的生成式策略映射的对数行列式仅由固定的求解器系数决定，从而实现了精确且无雅可比的似然比计算。这种设计保留了生成流策略的表达能力，同时避免了动作比率偏差和虚拟动作开销。我们在大规模模拟控制、微调和真实机器人操作任务上评估了GenPO++，与最先进的在线RL方法相比，它取得了具有竞争力或更优的性能，同时提高了训练稳定性和计算效率。

英文摘要

Generative policies provide expressive and multimodal action distributions, making them attractive for reinforcement learning (RL) in complex continuous-control tasks. Among them, flow-based policies are especially appealing because they generate actions through deterministic transport maps. However, applying such generative policies to likelihood-based on-policy learning remains limited by the difficulty of evaluating the probability of executed actions. Existing flow RL methods either replace the true action-density ratio with approximate surrogates, which can introduce biased updates, or recover exact likelihoods through dummy-action augmentation, which enlarges the policy space and increases computation. In this work, we propose GenPO++, a reversible generative policy optimization framework that uses history states as auxiliary memory in a high-order reversible ODE solver, yielding exact inversion without changing the original action dimension. The resulting generative policy map has a log-determinant determined only by fixed solver coefficients, enabling exact and Jacobian-free likelihood-ratio computation. This design preserves the expressiveness of generative flow policies while avoiding both action ratio bias and dummy-action overhead. We evaluate GenPO++ on large-scale simulated control, fine-tuning, and real-world robotic manipulation tasks, where it achieves competitive or superior performance over state-of-the-art on-policy RL methods, while improving training stability and computational efficiency.

URL PDF HTML ☆

赞 0 踩 0

2606.06966 2026-06-08 cs.CV 新提交

From Vision to Text: A Compact Multimodal Approach for Robust, Cross-Domain Presentation Attack Detection on ID Cards

从视觉到文本：一种用于身份证件跨域鲁棒演示攻击检测的紧凑多模态方法

Qingwen Zeng, Juan E. Tapia, Sneha Das, Christoph Busch

发表机构 * da/sec-Biometrics and Security Research Group, Hochschule Darmstadt（da/sec生物安全研究组，达姆施塔特应用技术大学）； Technical University of Denmark (DTU)（丹麦技术大学（DTU））

AI总结针对身份证件演示攻击检测中的跨域迁移问题，提出一种结合视觉与文本数据的紧凑多模态模型，发现监督微调后泛化强但零样本设置下失效，强调模型容量和真实数据的重要性。

Comments Publication under the revision process on IEEE

2606.06959 2026-06-08 cs.CL cs.AI 新提交

OpenHalDet: A Unified Benchmark for Hallucination Detection across Diverse Generation Scenarios

OpenHalDet：面向多种生成场景的幻觉检测统一基准

Xinyi Li, Zhen Fang, Yongxin Deng, Jinyuan Luo, Hongnan Ma, Changdae Oh, Zijing Shi, Shanshan Ye, Hanchen Wang, Shu-Lin Chen, Yadan Luo, Mengyue Yang, Sean Du, Sharon Li, Ling Chen

发表机构 * University of Technology Sydney（新南威尔士大学）； University of Wisconsin–Madison（威斯康星大学麦迪逊分校）； University of Bristol（布里斯托大学）； The University of Queensland（昆士兰大学）； Nanyang Technological University（南洋理工大学）

AI总结提出OpenHalDet基准，标准化幻觉检测评估流程，支持黑盒、灰盒、白盒检测器，实现跨任务、模型和检测器的可控比较。

Comments Preprint. Code and data are available at https://github.com/Nellie179/Hallucination-Detection

详情

AI中文摘要

幻觉检测对于大型语言模型（LLM）的可靠部署至关重要。然而，现有评估面临两个核心挑战：推理配置和评估不一致，以及下游领域和任务的覆盖有限。因此，报告的检测器性能往往难以比较、复现，并泛化到特定实验设置之外。我们引入OpenHalDet，一个面向多种生成场景的幻觉检测统一基准。OpenHalDet标准化了评估流程，从提示构建和响应生成到真实性标注、检测器评分和指标计算。它支持不同访问设置下的异构检测器家族，包括仅使用生成输出的黑盒方法、依赖基于概率信号的白盒方法，以及利用内部模型信号的白盒方法。通过将多样化的任务、模型和检测器纳入共享框架，OpenHalDet实现了可控比较，并提供了不同检测范式在LLM应用中行为的系统视角。我们发布OpenHalDet作为开放且可扩展的代码库，以促进幻觉检测方法的可复现评估和未来发展。代码和数据集可在该https URL获取。

英文摘要

Hallucination detection is essential for the reliable deployment of large language models (LLMs). However, existing evaluations face two core challenges: inconsistent inference configuration and evaluation, and limited coverage of downstream domains and tasks. Consequently, reported detector performance is often difficult to compare, reproduce, and generalize beyond specific experimental settings. We introduce OpenHalDet, a unified benchmark for hallucination detection across diverse generation scenarios. OpenHalDet standardizes the evaluation pipeline, from prompt construction and response generation to truthfulness annotation, detector scoring, and metric computation. It supports heterogeneous detector families under different access settings, including black-box methods that use only generated outputs, gray-box methods that rely on probability-based signals, and white-box methods that exploit internal model signals. By bringing diverse tasks, models, and detectors into a shared framework, OpenHalDet enables controlled comparison and provides a systematic view of how different detection paradigms behave in LLM applications. We release OpenHalDet as an open and extensible codebase to facilitate reproducible evaluation and future development of hallucination detection methods. The code and datasets are available at https://github.com/Nellie179/Hallucination-Detection.

URL PDF HTML ☆

赞 0 踩 0

2606.06958 2026-06-08 cs.CV 新提交

MVSegNet: A Lightweight Boundary-Aware Network for Fetal Lateral Ventricle Segmentation and Atrial Width Estimation in Prenatal Ultrasound

MVSegNet: 一种用于产前超声中胎儿侧脑室分割和心房宽度估计的轻量级边界感知网络

Arafat Hossain Sayem

发表机构 * Department of Computer Science & Engineering, Stamford University Bangladesh（计算机科学与工程系，斯坦福大学孟加拉国分校）

AI总结提出轻量级边界感知网络MVSegNet，结合多尺度特征提取与边界细化，在584张超声图像上实现侧脑室分割，Dice达80.79%，心房宽度平均绝对误差3.40 mm，速度快且参数少。

Comments 11 pages, 3 figures, 4 tables. Code and trained models will be released upon acceptance. Supplementary material available upon request

详情

AI中文摘要

胎儿脑室扩张通过测量产前超声中侧脑室的心房宽度来评估。准确的分割对于这一测量至关重要，但声影、散斑噪声和低对比度使其变得困难。我们开发了MVSegNet，一种轻量级的编码器-解码器网络，结合了多尺度特征提取和边界感知细化。该模型在584张专家标注的经脑室超声图像上使用70/15/15划分进行训练和评估。使用重叠、边界和测量指标与六个分割基线进行了性能比较。MVSegNet实现了80.79%的Dice分数、68.47%的IoU、4.07 mm的豪斯多夫距离和3.40 mm的心房宽度平均绝对误差。该模型包含231万个参数，在NVIDIA T4 GPU上以165.6帧/秒的速度运行。MVSegNet在边界和测量指标上优于所有评估的基线，同时保持较低的计算成本，支持其在自动化胎儿超声分析中的应用。

英文摘要

Fetal ventriculomegaly is assessed by measuring the atrial width of the lateral ventricle in prenatal ultrasound. Accurate segmentation is essential for this measurement, but acoustic shadowing, speckle noise, and poor contrast make it difficult. We developed MVSegNet, a lightweight encoder-decoder network combining multi-scale feature extraction and boundary-aware refinement. The model was trained and evaluated on 584 expert-annotated transventricular ultrasound frames using a 70/15/15 split. Performance was compared against six segmentation baselines using overlap, boundary, and measurement metrics. MVSegNet achieved a Dice score of 80.79%, IoU of 68.47%, Hausdorff distance of 4.07 mm, and atrial width mean absolute error of 3.40 mm. The model contains 2.31 million parameters and runs at 165.6 frames per second on an NVIDIA T4 GPU. MVSegNet outperformed all evaluated baselines on boundary and measurement metrics while maintaining low computational cost, supporting its use in automated fetal ultrasound analysis.

URL PDF HTML ☆

赞 0 踩 0

2606.06953 2026-06-08 cs.RO 新提交

LIMMT: Less is More for Motion Tracking

LIMMT：少即是多的运动追踪

Yu Guan, Zekun Qi, Chenghuai Lin, Xuchuan Chen, Dairu Liu, Wenyao Zhang, Jilong Wang, Xinqiang Yu, He Wang, Li Yi

发表机构 * Tsinghua University（清华大学）； GalBot ； Shanghai Jiao Tong University（上海交通大学）； Peking University（北京大学）； Shanghai Qi Zhi Institute（上海启智研究院）

AI总结提出数据驱动的运动追踪框架LIMMT，通过物理可行性、多样性和复杂性三维度筛选高质量运动数据，仅用AMASS的3%数据即可超越全量训练效果。

Comments Accepted at ICML 2026

2606.06950 2026-06-08 cs.CV cs.AI 新提交

When is 3D Worth It? A Resource-Performance Frontier for CNNs and Transformers in Lung CT

何时3D值得？肺CT中CNN和Transformer的资源-性能前沿

Md Enamul Hoq, Sharafat Hossain, Imraul Emmaka, Linda Larson-Prior, Lawrence Tarbox, Jonathan Bona, Donald Johann Jr. and Fred Prior

发表机构 * Department of Biomedical Informatics University of Arkansas for Medical Sciences（生物医学信息学系，美国阿肯色大学医学科学分校）； Department of Information Science University of Arkansas at Little Rock（信息科学系，美国阿肯色大学小岩分校）； Department of Neuroscience University of Arkansas for Medical Sciences（神经科学系，美国阿肯色大学医学科学分校）

AI总结研究在肺CT中2D、2.5D和3D输入对CNN和Transformer的影响，发现2.5D CNN在判别-稳定性权衡上最优，而3D CNN和Transformer存在不稳定性或退化预测。

Comments 8 pages, 6 figures

详情

AI中文摘要

三维模型通常被认为更适合体积医学成像，但其实际价值取决于性能提升是否值得增加的计算成本和复杂性。我们不提出新架构，而是研究在固定训练协议下，输入维度（2D、2.5D、3D）如何影响卷积神经网络（CNN）和视觉Transformer（ViT）的行为。使用无泄漏的NLST队列（n=1,977）和辅助LIDC-IDRI数据，我们发现2.5D CNN在我们的比较中提供了最有利的判别-稳定性权衡（ROC-AUC 0.682，95% CI [0.546, 0.799]），具有稳定的操作点。相比之下，3D CNN表现出阈值不稳定性，而Transformer出现退化预测，例如全正预测。置信区间宽且重叠，因此我们将这些结果呈现为受控的资源-性能前沿和失败模式分类，而非明确的优越性声明。对于类别不平衡的肺癌筛查分类，2D和2.5D输入在性能、稳定性和计算效率之间提供了比全3D表示更可靠的权衡。

英文摘要

Three-dimensional models are widely assumed preferable for volumetric medical imaging, yet their practical value depends on whether performance gains justify added computational cost and complexity. Rather than proposing a new architecture, we study how input dimensionality (2D, 2.5D, 3D) affects model behavior across convolutional neural networks (CNNs) and Vision Transformers (ViTs) under a fixed training protocol. Using a leakage-free NLST cohort (n = 1,977) with supporting LIDC-IDRI data, we find that the 2.5D CNN offers the most favorable discrimination-stability trade-off in our comparison (ROC-AUC 0.682, 95% CI [0.546, 0.799]) with a stable operating point. In contrast, 3D CNNs show threshold instability, and transformers exhibit degenerate predictions, such as all-positive predictions. Confidence intervals are wide and overlapping, so we present these results as a controlled resource-performance frontier and a failure-mode taxonomy rather than as definitive superiority claims. For class-imbalanced lung cancer screening classification, 2D and 2.5D inputs provide a more reliable trade-off between performance, stability, and computational efficiency than full 3D representations.

URL PDF HTML ☆

赞 0 踩 0

2606.06946 2026-06-08 cs.CL cs.AI 新提交

Auditing Training Data in Domain-adapted LLMs: LoRA-MINT

领域自适应大语言模型中的训练数据审计：LoRA-MINT

Gonzalo Mancera, Daniel DeAlcala, Aythami Morales, Julian Fierrez, Ruben Tolosana, Francisco Jurado

发表机构 * University of Granada（格拉纳达大学）

AI总结提出LoRA-MINT方法，通过成员推理测试审计LoRA微调的大语言模型训练数据，在四个模型和三个基准上达到0.77-0.92的精度，优于现有基线。

Comments IEEE Conf. on Computers, Software, and Applications (COMPSAC), 2026

详情

AI中文摘要

我们提出了LoRA-MINT，一种应用于通过低秩适应（LoRA）针对特定自然语言处理（NLP）任务微调的最新大语言模型（LLMs）的成员推理测试（MINT）新方法。主要目标是评估个体样本是否属于这些适应模型的训练数据，为知识产权和敏感数据管理提供有用的审计工具。我们的分析探索了模型困惑度与成员状态之间的关系，提供了一个系统框架来估计微调LLMs中的数据暴露程度。我们在四个模型和三个基准数据集上进行了实验，在确定给定数据是否用于训练时获得的精度值在0.77到0.92之间，优于最先进的基线，并证明了所提出方法的鲁棒性和通用性。总的来说，我们的发现强调了LoRA-MINT作为审计LLMs的有效且可扩展框架的潜力，提高了透明度，并促进了AI和NLP技术的道德和负责任部署。为了具体性和当前相关性，我们的讨论和实验集中在LoRA调整的LLMs上，但请注意，所提出的大部分方法很容易适用于审计任何其他适应LLMs的技术或更一般地任何其他领域自适应AI模型的训练数据。

英文摘要

We present LoRA-MINT, a new methodology for Membership Inference Test (MINT) applied to recent Large Language Models (LLMs) fine-tuned for specific Natural Language Processing (NLP) tasks through Low-Rank Adaptation (LoRA). The primary goal is to assess whether individual samples were part of the training data of these adapted models, providing a useful auditing tool for the management of intellectual property and sensitive data. Our analysis explores the relationship between model perplexity and membership status, providing a systematic framework for estimating data exposure in fine-tuned LLMs. We conducted experiments on four models and three benchmark datasets, obtaining precision values in determining if given data were used for training ranging from 0.77 to 0.92, which outperform state-of-the-art baselines and demonstrate the robustness and generality of the proposed method. In general, our findings underscore the potential of LoRA-MINT as an effective and scalable framework for auditing LLMs, improving transparency, and fostering the ethical and responsible deployment of AI and NLP technologies. For the sake of concreteness and current relevance, our discussion and experiments are centered on LoRAadjusted LLMs, but note that most of the presented methodology is easily applicable for auditing training data given any other technique for adapting LLMs or, more generally, any other domain-adapted AI models.

URL PDF HTML ☆

赞 0 踩 0

2606.06944 2026-06-08 cs.RO 新提交

T-GMP: Terrain-conditioned Generative Motion Priors for Versatile and Natural Humanoid Locomotion

T-GMP: 基于地形条件的生成式运动先验用于多功能且自然的人形机器人 locomotion

Junhong Guo, Hao Hu, Chen Chen, Haoxuan Han, Linao Gong, Xin Yang, Zhicheng He, Yao Su, Fenghua He

发表机构 * Harbin Institute of Technology（哈尔滨工程大学）； Leju Robotics（莱居机器人）

AI总结提出 T-GMP 模块，利用条件变分自编码器从少量专家演示中学习地形条件潜在运动流形，结合对抗学习与立足点惩罚，实现统一策略下适应地形变化的多功能自然运动。

详情

AI中文摘要

实现拟人自然性和鲁棒地形穿越仍然是人形机器人 locomotion 的基本挑战。现有的强化学习方法通常依赖固定的运动先验，限制了其对变化环境的适应性。我们提出基于地形条件的生成式运动先验（T-GMP），该模块使用条件变分自编码器从少量专家状态-地形演示中捕获地形条件潜在运动流形。学习到的先验能够实现平滑的风格转换，促进统一策略适应地形变化。我们将 T-GMP 集成到对抗学习流程中，并引入提出的立足点惩罚，其中判别器根据局部地形特征动态调节自然性约束，指导生成多功能且类人的运动。实验结果表明，我们的方法在穿越成功率和运动平滑度上优于现有基线，同时保持了仿生自然和物理协调的运动。

英文摘要

Achieving both anthropomorphic naturalness and robust terrain traversal remains a fundamental challenge in humanoid locomotion. Existing Reinforcement Learning (RL) approaches typically rely on fixed motion priors, limiting their adaptability to varying environments. We propose Terrain-conditioned Generative Motion Priors (T-GMP), a module that captures a terrain-conditioned latent motion manifold from a few expert state-terrain demonstrations using a Conditional Variational Autoencoder (CVAE). The learned priors enable smooth style transitions, facilitating a unified policy that adapts to terrain variations. We integrate T-GMP into an adversarial learning pipeline with our proposed Foothold Penalty, where a discriminator dynamically modulates naturalness constraints conditioned on local terrain features, guiding the generation of versatile and human-like motions. Experimental results demonstrate that our method outperforms existing baselines in traversal success rate and motion smoothness, while preserving biomimetically natural and physically coordinated motions.

URL PDF HTML ☆

赞 0 踩 0

2606.06943 2026-06-08 cs.CV cs.AI 新提交

SS-TPT: Stability and Suitability-Guided Test-Time Prompt Tuning for Adversarially Robust Vision-Language Models

SS-TPT：面向对抗鲁棒视觉语言模型的稳定性和适用性引导的测试时提示微调

Sunoh Kim, Daeho Um

发表机构 * Dankook University, Yongin, South Korea（首尔大学，韩国永兴）； University of Seoul, Seoul, South Korea（首尔大学，韩国首尔）

AI总结提出SS-TPT方法，通过稳定性与适用性分数评估增强视图质量，引导测试时提示微调，在保持高吞吐量的同时显著提升对抗鲁棒性。

Comments Accepted in ICML2026

详情

AI中文摘要

视觉语言模型（如CLIP）实现了强大的零样本识别，但在对抗扰动下仍然非常脆弱。最近的测试时自适应防御通过利用大量增强视图来提高鲁棒性，但这导致了不切实际的减速和明确的鲁棒性-吞吐量权衡。为了应对这一挑战，我们提出了稳定性和适用性引导的测试时提示微调（SS-TPT），通过两个互补分数评估每个增强视图的质量：（1）稳定性，衡量对弱增强的预测不变性，以及（2）适用性，衡量视图间的特征空间密度。这些稳定性和适用性（SS）分数通过SS引导的一致性损失和SS加权预测来指导自适应和推理，放大可信视图同时抑制受损视图。大量实验表明，SS-TPT显著优于先前最先进的方法，在不同数据集和不同视图数量下实现了卓越的鲁棒性-吞吐量权衡，从而展示了强大的实用性和泛化性。我们的代码可在以下网址获得：https://this URL。

英文摘要

Vision-language models (VLMs) such as CLIP achieve strong zero-shot recognition but remain highly fragile under adversarial perturbations. Recent test-time adaptation defenses improve robustness by leveraging many augmented views, but this leads to impractical slowdown and a clear robustness-throughput trade-off. To address this challenge, we present Stability and Suitability-guided Test-time Prompt Tuning (SS-TPT), evaluating the quality of each augmented view via two complementary scores: (1) stability, measuring prediction invariance to weak augmentations, and (2) suitability, measuring feature-space density among views. These stability and suitability (SS) scores guide both adaptation and inference through an SS-guided consistency loss and an SS-weighted prediction, amplifying trustworthy views while suppressing corrupted ones. Extensive experiments demonstrate that SS-TPT significantly outperforms prior state-of-the-art methods, achieving superior robustness-throughput trade-offs across diverse datasets and varying numbers of views, thereby demonstrating both strong practicality and generality. Our code is available at https://github.com/sunoh-kim/SS-TPT.

URL PDF HTML ☆

赞 0 踩 0

2606.06942 2026-06-08 cs.CL cs.AI 新提交

Didact: A Cross-Domain Capability Discovery System for Defence

Didact：面向国防的跨领域能力发现系统

Aarya Bodhankar, Aditya Joshi, Bao Gia Doan, Thomas Marchant, Oscar Leslie, Flora Salim

发表机构 * University of New South Wales, Sydney, Australia（新南威尔士大学，悉尼，澳大利亚）； Cyndr.ai, Australia（Cyndr.ai，澳大利亚）

AI总结提出Didact原型系统，通过构建知识图谱和复合检索增强生成管道，整合异构国防报告与政策文档，支持自然语言对话和可视化证据追溯，解决跨领域能力发现碎片化问题。

Comments Under Review at CIKM 2026 (System Demonstration Track)

详情

AI中文摘要

国防及国防相关领域的政策制定者必须监控快速发展的研究以及与其作战和战略需求相关的部门优先事项。实际上，这些来源分散在异构格式、不连贯的存储库和孤立的更新流中，使得能力发现缓慢且难以审计。我们提出了Didact，一个原型系统，它将来自澳大利亚的公开国防报告和政策文件与基于澳大利亚研究出版物构建的专用知识图谱相结合。Didact为面向政策的工作流程提供自然语言对话，并利用复合检索增强生成（RAG）管道。Didact的一个关键特性是交互式证据轨道，它可以可视化检索到的证据和源关系。我们对Didact的输出质量和运行时间的评估凸显了其实用性。虽然Didact是作为澳大利亚背景下的学术界-工业界合作项目共同开发的，但它适用于知识同样碎片化的其他领域。演示视频可在此处获取：

英文摘要

Policymakers in defence and defence-aligned sectors must monitor rapidly evolving research alongside sector priorities relevant to operational and strategic needs. In practice, these sources are fragmented across heterogeneous formats, disjoint repositories, and siloed update streams, making capability discovery slow and difficult to audit. We present Didact, a prototype that integrates publicly available defence reports and policy documents from Australia with a purpose-built knowledge graph derived from Australian research publications. Didact provides natural language conversations for policy-oriented workflows, and leverages a composite retrieval-augmented generation (RAG) pipeline. A key feature of Didact is an interactive Evidence Rail that visualises retrieved evidence and source relationships. Our evaluation of the output quality and runtime of Didact highlights its utility. While Didact has been co-developed as an academia-industry project for the Australian context, it is adaptable to other domains where knowledge is similarly fragmented. A demonstration video is available here:

URL PDF HTML ☆

赞 0 踩 0

2606.06941 2026-06-08 cs.AI 新提交

Quantum-Inspired Trace-Augmented Evidence Selection for Reasoning over Structured Hypothesis Spaces

量子启发的迹增强证据选择用于结构化假设空间推理

Laura Wynter, Nirvik Sahoo, Paul Griffin

发表机构 * School of Computing and Information Systems（计算与信息系统学院）； Singapore Management University（新加坡管理大学）

AI总结提出EP-HUBO方法，将CoT推理片段选择转化为组合优化问题，通过高阶二元优化聚合证据，在证据密集型法律推理基准上提升少数但正确假设的权重。

详情

AI中文摘要

大型语言模型（LLMs）现在能够在广泛的专业级考试中达到或超过人类水平，但在法律等专门、证据密集型领域仍然脆弱。在这些任务上，错误不仅源于世界知识的空白，还源于证据片段之间的细微差别以及支持证据的不一致使用。最常见的基于采样思维链（CoT）轨迹的聚合器——多数投票，返回最流行的答案，而不考虑其证据是否实际上最强。我们提出将CoT推理片段的选择视为一个显式的组合优化问题，使得有充分支持但属于少数的假设能够覆盖噪声多数，并在对证据质量特别敏感的法律推理基准上评估该方法。我们引入了EP-HUBO（证据池高阶二元优化），它使用小型本地模型生成多个CoT轨迹，将片段解析为每个假设的证据池，对每个池求解具有质量衍生权重（相关性、特异性、区分性）的高阶无约束二元优化，并委托前沿模型对每个问题进行一次裁决调用。我们在两个证据密集型法律基准上评估了EP-HUBO，使用了经典硬件上的模拟退火以及Quantum Computing Inc.的Dirac-3光量子熵量子机。HUBO风格的优化提供了一种原则性的方法来聚合推理片段，同时保留少数但正确的假设，并且在低污染领域（前沿模型尚未吸收基准材料）中最为有价值。

英文摘要

Large language models (LLMs) now solve a wide range of expert-level exams at or above human level, yet remain brittle on specialised, evidence-intensive domains such as law. On these tasks, errors arise not only from gaps in world knowledge but also from subtle distinctions between pieces of evidence and inconsistent use of supporting evidence. The most common aggregator over sampled chain-of-thought (CoT) traces, majority vote, returns the most popular answer regardless of whether its evidence is actually strongest. We propose to treat the selection of CoT reasoning fragments into a set of evidence as an explicit combinatorial optimisation problem, allowing well-supported but minority hypotheses to override noisy majorities, and to evaluate the approach on legal-reasoning benchmarks that are particularly sensitive to evidence quality. We introduce EP-HUBO (Evidence Pool Higher-Order Binary Optimisation), which generates multiple CoT traces with a small local model, parses fragments into per-hypothesis evidence pools, solves a higher-order unconstrained binary optimisation per pool with quality-derived weights (relevance, specificity, distinctiveness), and delegates a single adjudication call per question to a frontier model. We evaluate EP-HUBO on two evidence-intensive legal benchmarks using both simulated annealing on classical hardware and the Dirac-3 photonic entropy-quantum machine from Quantum Computing Inc. HUBO-style optimisation gives a principled way to aggregate reasoning fragments while preserving minority-but-correct hypotheses, and is most valuable in low-contamination domains where frontier models have not already absorbed the benchmark material.

URL PDF HTML ☆

赞 0 踩 0

2606.06938 2026-06-08 cs.CV 新提交

When CLIP Sees More, It Fights Back Harder: Multi-View Guided Adaptive Counterattacks for Test-Time Adversarial Robustness

当CLIP看得更多，它反击得更猛烈：多视图引导的自适应对抗攻击用于测试时对抗鲁棒性

Sunoh Kim, Daeho Um

发表机构 * Dankook University（Dankook 大学）； University of Seoul（首尔大学）； Yongin, South Korea（韩国 Yongin）； Seoul, South Korea（首尔, 韩国）

AI总结提出多视图引导的自适应对抗攻击（MAC），通过构建输入图像的增强视图、执行对抗攻击精炼嵌入、自适应调整攻击强度并聚合视图，显著提升CLIP在测试时的对抗鲁棒性。

Comments Accepted in CVPR2026

详情

AI中文摘要

视觉-语言模型如CLIP在零样本识别方面取得了显著成就，但其对对抗扰动的鲁棒性仍然有限。最近提出的测试时对抗攻击（TTC）通过在推理过程中扰动输入图像使其远离受损状态来提高CLIP的鲁棒性。然而，TTC在强攻击下仍然脆弱，因为其对抗攻击依赖于直接受损的原始视图，并采用噪声驱动的硬门控方案，无法适应变化的损坏严重程度。为了解决这些限制，我们引入了多视图引导的自适应对抗攻击（MAC），它针对多视图执行具有损坏感知软加权的对抗攻击。具体来说，MAC首先构建输入图像的增强视图以获得多样化的嵌入。然后，它执行对抗攻击以精炼视图的受损嵌入。接下来，MAC根据每个视图的估计损坏程度自适应地缩放对抗攻击强度。最后，自适应对抗攻击后的视图被聚合以产生鲁棒的最终预测。在20个数据集和多种攻击场景下的广泛实验表明，MAC显著提高了鲁棒性，同时由于其无调优设计，保持了高推理速度和内存效率。我们的代码可在该https URL获取。

英文摘要

Vision-language models such as CLIP have achieved remarkable zero-shot recognition capabilities, yet their robustness against adversarial perturbations remains limited. Test-time counterattack (TTC) was recently proposed to improve CLIP's robustness by perturbing an input image to steer it away from a corrupted state during inference. However, TTC remains fragile under strong attacks because its counterattack relies on a directly corrupted original view and employs a noise-driven hard-gating scheme that cannot adapt to varying corruption severity. To address these limitations, we introduce Multi-view guided Adaptive Counterattack (MAC), which performs counterattacks for multi-view with corruption-aware soft weighting. Specifically, MAC first constructs augmented views of an input image to obtain diverse embeddings. It then performs counterattacks to refine corrupted embeddings of views. Next, MAC adaptively scales the counterattack intensity for each view based on its estimated corruption degree. Finally, the adaptively counterattacked views are aggregated to yield a robust final prediction. Extensive experiments across 20 datasets and diverse attack scenarios demonstrate that MAC substantially improves robustness while preserving high inference speed and memory efficiency with its tuning-free design. Our code is available at https://github.com/sunoh-kim/MAC.

URL PDF HTML ☆

赞 0 踩 0

2606.06934 2026-06-08 cs.LG 新提交

Uniform Stability and Generalization Error of GD and SGD on Fixed-Point Parameters

固定点参数上GD和SGD的均匀稳定性与泛化误差

Jonghyun Shin, Sejun Park

发表机构 * Department of Artificial Intelligence, Korea University（人工智能系，韩国大学）

AI总结研究离散参数空间中梯度下降（GD）和随机梯度下降（SGD）的泛化误差与均匀稳定性，发现确定性舍入使GD泛化误差率从O(T/n)恶化到O(T/√n)，而SGD在确定性舍入下仍具有非平凡稳定性保证，且随机舍入会引入随维度增长的泛化误差。

详情

AI中文摘要

我们分析了离散参数空间上梯度下降（GD）和随机梯度下降（SGD）的泛化误差、均匀稳定性和均匀参数稳定性，其中每次更新涉及确定性或随机舍入。我们表明，确定性舍入降低了GD在凸、Lipschitz和平滑损失函数上的泛化误差，将速率从$O(T/n)$增加到$O(T/\sqrt{n})$，并建立了匹配的下界。我们进一步证明GD的均匀稳定性变为$\Omega(T)$，表明基于稳定性的泛化界在此设置中是无效的。相比之下，对于相同的损失，带有确定性舍入的随机梯度下降具有非平凡的均匀稳定性保证，这些保证与实值情况有质的区别，并且在迭代次数和维度上表现出不同的依赖性：我们证明了一维的紧界$O(T/n)$和高维的$O(T^2/n)$。我们还表明，随机舍入可能引入随维度增加的泛化误差；这种现象在标准实值优化和确定性舍入情况下是不存在的。最后，我们给出了随机舍入方案的均匀参数稳定性的上界，并表明当损失可以表示为坐标函数之和时，这些界是紧的。

英文摘要

We analyze generalization error, uniform stability, and uniform argument stability of gradient descent (GD) and stochastic gradient descent (SGD) over discrete parameter spaces, where each update involves deterministic or stochastic rounding. We show that deterministic rounding degrades the generalization error of GD on convex, Lipschitz, and smooth loss functions, increasing the rate from $O(T/n)$ to $O(T/\sqrt{n})$, and establish matching lower bounds. We further prove that uniform stability of GD becomes $Ω(T)$, showing that stability-based generalization bounds are vacuous in this setting. In contrast, for the same losses, stochastic gradient descent with deterministic rounding admits nontrivial uniform stability guarantees, which differ qualitatively from the real-valued case and exhibit distinct dependencies on the number of iterations and the dimension: we prove tight bounds $O(T/n)$ for one dimension and $O(T^2/n)$ for higher dimensions. We also show that stochastic rounding can introduce generalization error that increases with the dimension; such a phenomenon is absent in standard real-valued optimization and in the deterministic rounding case. Finally, we provide upper bounds on uniform argument stability for stochastic rounding schemes and show that these bounds are tight when the loss can be represented as a sum of coordinate-wise functions.

URL PDF HTML ☆

赞 0 踩 0

2606.06928 2026-06-08 cs.SD eess.AS 新提交

VoxCPM2 Technical Report

VoxCPM2 技术报告

Yixuan Zhou, Guoyang Zeng, Xin Liu, Xiang Li, Renjie Yu, Jiancheng Gui, Jiaheng Wu, Ziyang Wang, Xudong Shen, Runchuan Ye, Zhisheng Zhang, Jiuyang Zhou, Bingsong Bai, Weiyue Sun, Mengyuan Deng, Qundong Shi, Zhiyong Wu, Zhiyuan Liu

发表机构 * VoxCPM Team（VoxCPM 团队）

AI总结提出VoxCPM2，一种全开源多语言可控语音生成基础模型，通过层次化扩散自回归建模、非对称AudioVAE和2B参数/200万小时数据扩展，在零样本和指令跟随TTS基准上达到SOTA，平均WER为1.68%。

Comments The technical report of VoxCPM2, a TTS foundation model (GitHub: https://github.com/OpenBMB/VoxCPM)

详情

AI中文摘要

我们提出VoxCPM2，一个完全开源的多语言可控语音生成基础模型，它扩展了VoxCPM的层次化扩散自回归建模范式。VoxCPM2在三个关键维度上推进了该框架：(i) 能力，通过统一30种语言、9种中文方言、自然语言语音设计、风格可控的语音克隆以及高保真延续克隆于单个骨干网络；(ii) 质量，通过非对称AudioVAE以16 kHz编码并以48 kHz重建，实现具有高编码效率的隐式超分辨率；(iii) 规模，通过将模型联合扩展到2B参数，训练数据超过200万小时的多语言语音。为了在单个模型中支持这些多样化的能力，我们引入了一种统一的序列组织方式，通过相同输入构建块的不同排列来表达所有生成模式，从而允许在单一参数集和目标下进行联合训练。VoxCPM2在公共零样本和指令跟随TTS基准上达到了最先进或具有竞争力的性能。在我们的内部30语言评估集上，它取得了平均1.68%的词错误率。这些结果表明，层次化连续潜在建模无需依赖任何外部离散语音分词器，为大规模多语言可控语音生成提供了可行且强大的基础。模型权重、微调代码和推理工具已在Apache 2.0许可下公开发布，以促进社区研究和开发。

英文摘要

We present VoxCPM2, a https://info.arxiv.org/help/prep#abstractsfully open-source multilingual and controllable speech generation foundation model that extends the hierarchical diffusion-autoregressive modeling paradigm of VoxCPM. VoxCPM2 advances the framework in three key dimensions: (i) capability, by unifying 30 languages, 9 Chinese dialects, natural-language voice design, style-controllable voice cloning, and high-fidelity continuation cloning within a single backbone; (ii) quality, through an asymmetric AudioVAE that encodes at 16 kHz and reconstructs at 48 kHz, enabling implicit super-resolution with high encoding efficiency; and (iii) scale, by jointly scaling the model to 2B parameters and the training data to over 2 million hours of multilingual speech. To support these diverse capabilities within one model, we introduce a unified sequence organization that expresses all generation modes through different arrangements of the same input building blocks, allowing joint training under a single set of parameters and objective. VoxCPM2 achieves state-of-the-art or competitive performance on public zero-shot and instruction-following TTS benchmarks. On our internal 30-language evaluation set, it attains an average WER of 1.68%. These results demonstrate that hierarchical continuous-latent modeling, without relying on any external discrete speech tokenizer, offers a viable and powerful foundation for large-scale multilingual and controllable speech generation. The model weights, fine-tuning code, and inference tools are publicly released under the Apache 2.0 license to foster community research and development.

URL PDF HTML ☆

赞 0 踩 0

2606.06926 2026-06-08 cs.CV cs.MM 新提交

SVHighlights: Towards Extremely Long Sport Video Highlight Detection

SVHighlights: 迈向极长体育视频精彩片段检测

Donggyu Lee, Youngbin Ki, Jeonghun Kang, Taehwan Kim

发表机构 * Ulsan National Institute of Science and Technology（釜山国立科学研究院）

AI总结针对现有方法无法处理超长视频精彩片段检测的问题，提出首个基准SVHighlights（包含320个平均时长2小时的体育视频）以及无训练的分段方法TF-SELECTOR，通过大语言模型融合多模态信息预测片段级显著性分数，在多个指标上超越现有基线。

Comments Accepted to KDD 2026 (Datasets and Benchmarks Track). Project Page: https://leedongkyu2019.github.io/SVHighlights/

详情

DOI: 10.1145/3770855.3817564

AI中文摘要

尽管长视频的精彩片段检测具有重要的实际意义，但现有方法大多局限于短视频内容，这主要是由于缺乏合适的基准。为了填补这一空白，我们引入了SVHighlights，据我们所知，这是首个针对极长体育视频（每段时长超过一小时，涵盖多种体育类别）精彩片段检测的基准。SVHighlights是通过一个数据集生成流水线，从完整体育视频及其对应的官方精彩片段视频对构建而成，无需传统的逐片段显著性标注即可实现可扩展的标签生成。该基准包含320个视频，平均时长2.00小时，总时长640.18小时，显著超过以往的数据集。现有方法在长视频上也面临根本性挑战：在短视频片段上训练的模型无法泛化到小时级内容，并且它们的片段级评分缺乏识别精彩片段所需的更广泛上下文。为了解决这一问题并提供一个强基线，我们提出了TF-SELECTOR，一种无需训练的基于分段的方法，该方法通过合并相邻的具有相同语义内容的镜头，将每个视频划分为上下文感知的分段，并使用多模态输入（包括视觉描述、转录文本和音频音量）的大语言模型预测分段级显著性分数。实验表明，与视频时间定位（VTG）微调的基线相比，TF-SELECTOR在大多数指标上取得了更优的性能，在HIT@1上提升+3.12，在HIT@K上提升+4.06，在IoU上提升+2.95。这些结果确立了SVHighlights作为长视频精彩片段检测的具有挑战性的测试平台，并证明了简单的基于分段的策略可以有效地扩展到小时级视频。

英文摘要

While highlight detection for long-form videos is of great practical importance, most existing methods remain limited to short-form content, largely due to the absence of a suitable benchmark. To bridge this gap, we introduce SVHighlights, to the best of our knowledge, the first benchmark for highlight detection in extremely long sports videos, each exceeding one hour in duration, across multiple sports categories. SVHighlights is constructed from pairs of full-length sports videos and their corresponding official highlight videos using a dataset generation pipeline, enabling scalable label generation without conventional per-clip saliency annotation. The benchmark comprises 320 videos with an average duration of 2.00 hours and a total of 640.18 hours, substantially exceeding previous datasets. Existing methods also face fundamental challenges on long videos: models trained on short clips fail to generalize to hour-long content, and their clip-level scoring lacks the broader context needed to identify highlights. To address this and provide a strong baseline, we present TF-SELECTOR, a training-free segment-based approach that divides each video into context-aware segments by merging adjacent shots sharing the same semantic content, and predicts segment-level saliency scores using a large language model with multimodal inputs including visual captions, transcripts, and audio volume. Experiments demonstrate that TF-SELECTOR achieves superior performance across most metrics compared to Video Temporal Grounding (VTG)-tuned baselines, with improvements of +3.12 in HIT@1, +4.06 in HIT@K, and +2.95 in IoU. These results establish SVHighlights as a challenging testbed for long-form highlight detection and demonstrate that a simple segment-based strategy can effectively scale to hour-long videos.

URL PDF HTML ☆

赞 0 踩 0

2606.06924 2026-06-08 cs.LG 新提交

From Sampled Outcomes to Capability Distributions: Rethinking Supervision for LLM Routing

从采样结果到能力分布：重新思考LLM路由的监督

Guannan Lai, Haoran Hu, Long Chen, Zhenguo Li, Han-Jia Ye

发表机构 * School of Artificial Intelligence, Nanjing University（南京大学人工智能学院）； National Key Laboratory for Novel Software Technology, Nanjing University（南京大学新型软件技术国家重点实验室）； Hong Kong University of Science and Technology（香港科学与技术大学）； Frontier Robotics（前沿机器人）

AI总结针对LLM路由中单次响应作为监督信号噪声大的问题，提出DARS框架，从分布视角构建路由监督，考虑输入和输出不确定性，实验表明分布感知监督更稳定有效。

详情

AI中文摘要

现有的LLM路由方法通常将模型对查询的单个响应作为训练路由器的能力标签。然而，由于LLM生成本质上是随机的，这种单次监督仅提供了查询-模型对行为的噪声观测，而非可靠的能力估计。我们表明，这种假设会向路由监督中引入系统性噪声，使得学习到的路由策略可靠性降低。为解决此问题，我们提出DARS（分布感知路由监督）框架，该框架从模型行为的分布视角构建路由监督。DARS不依赖单个生成的响应，而是考虑来自输入侧和输出侧的不确定性，捕捉语义等价的查询表述和随机生成如何影响模型性能。基于这些分布感知的观测，DARS为路由构建更可靠的监督信号。跨不同任务的实验表明，单次标签可能对模型选择产生误导，而分布感知监督提供更稳定的标签并改进学习到的路由行为。我们的结果表明，可靠的LLM路由应超越单次响应观测，并基于查询级模型能力分布。

英文摘要

Existing LLM routing methods typically treat a model's single response to a query as its capability label for training routers. However, because LLM generation is inherently stochastic, such single-shot supervision provides only a noisy observation of a query-model pair's behavior rather than a reliable capability estimate. We show that this assumption introduces systematic noise into routing supervision, making learned routing policies less reliable. To address this issue, we propose DARS (Distribution-Aware Routing Supervision), a framework that constructs routing supervision from a distributional view of model behavior. Instead of relying on a single generated response, DARS considers uncertainty from both the input side and the output side, capturing how semantically equivalent query formulations and stochastic generations affect model performance. Based on these distribution-aware observations, DARS builds more reliable supervision signals for routing. Experiments across diverse tasks show that single-shot labels can be misleading for model selection, while distribution-aware supervision provides more stable labels and improves learned routing behavior. Our results suggest that reliable LLM routing should move beyond single-response observations and be grounded in query-level model capability distributions.

URL PDF HTML ☆

赞 0 踩 0

2606.06923 2026-06-08 cs.AI cs.SE 新提交

Declarative Skills for AI Agents in Knowledge-Grounded Tool-Use Workflows

知识驱动工具使用工作流中AI代理的声明式技能

M. Danish Lim, I. Danial Bin Sharudin, Wen Han Chen, Cedric Lim, Laura Wynter

发表机构 * School of Computing and Information Systems（计算与信息系统学院）； Singapore Management University（新加坡管理大学）

AI总结提出声明式代理（通过自然语言技能文件控制流程）在知识密集型客服工作流中优于命令式状态机和无脚手架基线，但检索质量是主要瓶颈。

详情

AI中文摘要

我们研究了在非结构化知识库上的现实客服工作流中，使用工具的AI代理的编排机制。我们认为声明式代理——即在系统提示中附加自然语言技能文件的AI代理——是一种有效的编排范式。具体地，我们比较了(i) 在推理时读取三个领域特定技能文件并自行决定控制流的DeclarativeAgent，(ii) 基于具有显式阶段的程序化状态机的ImperativeAgent，以及(iii) 基于$\ au$-Knowledge基准代理的无脚手架基线代理。我们的ImperativeAgent受递归语言模型和图编排框架中的外部化控制推理启发。我们将三种代理形式化为分散部分可观察马尔可夫决策过程中的策略类，并分析其信息论和结构特性；然后在五个语言模型和两种检索机制下实证测试预测的差异。结果表明，检索质量是AI代理的主要瓶颈：当证据不完整或偏斜时，所有代理性能大幅下降，技能文件无法恢复损失的性能。然而，在高品质检索下，声明式技能在程序性任务上持续提高准确性并减少编排错误，而命令式状态机的脆弱性并未可靠地提高任务成功或合规性。

英文摘要

We study orchestration mechanisms for tool-using AI agents in realistic customer-service workflows over an unstructured knowledge base. We argue that declarative agents -- AI agents equipped with natural-language skill files appended to the system prompt -- are an effective orchestration paradigm. Concretely, we compare (i) a DeclarativeAgent that reads three domain-specific skill files at inference time and decides its own control flow, (ii) an ImperativeAgent based on a programmatic state machine with explicit phases, and (iii) an unscaffolded baseline agent modeled after the $τ$-Knowledge benchmark agent. Our ImperativeAgent is motivated by externalised-control inference as in Recursive Language Models and graph-based orchestration frameworks. We formalise the three agents as policy classes within a decentralised partially-observable Markov decision process and analyse their information-theoretic and structural properties; we then test the predicted differences empirically on five language models and two retrieval regimes. Our results show that retrieval quality is a dominant bottleneck for AI agents: when evidence is incomplete or skewed, all agents degrade substantially, and skill files cannot recover lost performance. Under high-quality retrieval, however, declarative skills consistently improve accuracy on procedural tasks and reduce orchestration errors, while the imperative state machine's brittleness does not reliably improve task success or compliance.

URL PDF HTML ☆

赞 0 踩 0

2606.06920 2026-06-08 cs.LG cs.AI 新提交

The Fine-Tuning Trap: Evaluating Negative Transfer and the Role of PEFT in Sub-1B Mathematical Reasoning

微调陷阱：评估负迁移及PEFT在亚十亿参数数学推理中的作用

Rahul Nair, Chun Tao

发表机构 * GitHub ； University of California, Berkeley（加州大学伯克利分校）； Stanford University（斯坦福大学）

AI总结本研究评估了五种亚十亿参数模型在数学推理任务中的微调策略，发现全量微调对小于3亿参数的模型造成负迁移，而参数高效微调（PEFT）是稳定性要求。

Comments 8 pages, 6 figures, 2 tables

详情

AI中文摘要

在边缘设备上部署小型语言模型（SLM）需要高效的微调策略，使模型适应新任务而不降低其通用能力。在本研究中，我们对五种亚十亿参数模型（135M-1B）在数学推理任务上进行了基准测试，并发现了一个关键脆弱性：全量微调（Full FT）会主动损害300M以下参数模型的性能，通常将准确率降至零样本基线以下。这种“负迁移”使得参数高效微调（PEFT）不仅是效率上的偏好，更是稳定性上的要求。我们发现，虽然低秩适应（LoRA）和权重分解LoRA（DoRA）性能相当，但它们的优势因任务而异：DoRA在复杂推理（GSM8K）中表现出色，而LoRA在模式匹配（OrcaMath）中占主导地位。特别地，在对齐模型（Qwen2.5-0.5B）上，LoRA优于全量微调，甚至在最小架构（SmolLM2-135M）上，简单的5-shot上下文学习也优于全量微调。基于这些发现，我们建议对所有对齐的亚十亿参数模型默认使用PEFT，并警告不要对任何小于500M参数的架构使用全量微调，以防止灾难性遗忘。本工作的复现可在此网址找到：https://this URL。

英文摘要

Deploying Small Language Models (SLMs) on edge devices requires efficient fine-tuning strategies that adapt models to new tasks without degrading their general capabilities. In this study, we benchmark five sub-1B models (135M-1B) on mathematical reasoning tasks and uncover a critical vulnerability: Full Fine-Tuning (Full FT) actively harms performance in models under 300M parameters, often dropping accuracy below zero-shot baselines. This "negative transfer" makes Parameter-Efficient Fine-Tuning (PEFT) not just an efficiency preference, but a stability requirement. We find that while Low-Rank Adaptation (LoRA) and Weight-Decomposed LoRA (DoRA) perform comparably, their strengths vary by task; DoRA excels in complex reasoning (GSM8K), while LoRA dominates pattern matching (OrcaMath). In particular, Full FT is outperformed by LoRA on aligned models (Qwen2.5-0.5B) and even by simple 5-shot In-Context Learning on the smallest architectures (SmolLM2-135M). Based on these findings, we recommend defaulting to PEFT for all aligned sub-1B models and caution against Full FT for any architecture smaller than 500M parameters to prevent catastrophic forgetting. Reproduction of this work can be found at https://github.com/gulguluu/tiny-slm-finetune-compare.

URL PDF HTML ☆

赞 0 踩 0

2606.06918 2026-06-08 cs.CV 新提交

DRIFT: From Robustness Gaps to Invariance Manifolds for AI-Generated Image Detection

DRIFT: 从鲁棒性差距到AI生成图像检测的不变流形

Abhishek Ameta, Sayan Banerjee, Shreyas Pandith, Harshit, Ankita Chatterjee, Akshay Janardan Bankar, Amit Satish Unde

发表机构 * Samsung Research Institute, Bangalore, India（三星研究所，班加罗尔，印度）

AI总结提出DRIFT方法，通过冻结视觉基础模型并学习真实图像的结构化不变流形，利用鲁棒和脆弱子空间分解及排序间隔实现AI生成图像检测，在未见生成器和分辨率上表现优异。

Comments Submitted to ECCV 2026

详情

AI中文摘要

生成图像模型的快速演进挑战了现有的AI生成图像检测器，尤其是在面对未见生成器的开放世界场景中。近期无训练方法通过测量冻结视觉基础模型（VFM）中的鲁棒性差距，利用扰动引起的嵌入漂移检测伪造图像。然而，这些方法依赖于预训练继承的固定不变几何结构，缺乏针对检测任务的原则性适应。我们转而将AI生成图像检测表述为在单类监督下学习真实图像的结构化不变流形。基于冻结的VFM，我们引入轻量级投影头，将表示空间分解为互补的鲁棒子空间和脆弱子空间。鲁棒子空间被显式训练以抑制由物理上合理的成像变换引起的变异，近似真实图像流形的切方向，而脆弱子空间则保持对类似编辑扰动的敏感性。结构化的排序间隔强制实现物理不变性与编辑诱导变异性之间的层次分离，使得检测成为相对于所学流形的间隔违反测试。在推理时，两种变换族下的多尺度逐块漂移产生双通道不变性特征和可解释的定位。大量实验表明，该方法在未见生成器和分辨率上具有强大的开放世界泛化能力，始终优于基于无训练鲁棒性的基线方法，同时提供可解释的不变性违反图。

英文摘要

The rapid evolution of generative image models challenges existing AI-generated image detectors, particularly in open-world settings with unseen generators. Recent training-free approaches measure robustness gaps in frozen vision foundation models (VFMs), detecting fakes via perturbation-induced embedding drift. However, these methods rely on fixed invariance geometry inherited from pretraining and lack principled adaptation to the detection task. We instead formulate AI-generated image detection as learning a structured invariance manifold of real images under one-class supervision. Building upon a frozen VFM, we introduce lightweight projection heads that decompose representation space into complementary robust and fragile subspaces. The robust subspace is explicitly trained to suppress variations induced by physically plausible imaging transformations, approximating tangent directions of a real-image manifold, while the fragile subspace retains sensitivity to edit-like perturbations. A structured ordering margin enforces hierarchical separation between physical invariance and edit-induced variability, enabling detection as a margin-violation test relative to the learned manifold. At inference, multi-scale patch-wise drift under both transformation families yields a dual-channel invariance signature and interpretable localization. Extensive experiments demonstrate strong open-world generalization across unseen generators and resolutions, consistently outperforming training-free robustness-based baselines while providing interpretable invariance-violation maps.

URL PDF HTML ☆

赞 0 踩 0

2606.06908 2026-06-08 cs.CV 新提交

polyDAG: Polynomial Acyclicity Constraints for Efficient Continuous Causal Discovery in Visual Semantic Graphs

polyDAG：用于视觉语义图中高效连续因果发现的多项式无环性约束

Wenhao Zhang, Ramin Ramezani, Tao Han, Kai Hwang, Minyi Guo

发表机构 * Shanghai Jiao Tong University（上海交通大学）； University of California, Los Angeles（加州大学洛杉矶分校）； The Chinese University of Hong Kong, Shenzhen（香港中文大学（深圳））

AI总结提出多项式无环性约束框架polyDAG，用有限多项式迹约束替代矩阵指数约束，实现视觉语义图中高效的连续有向无环图学习，在合成图和CelebA数据集上提升了效率与结构恢复性能。

详情

AI中文摘要

现代图像分析流程通常将图像转换为结构化语义变量，如面部属性、对象概念和场景描述符。学习这些变量之间的有向依赖关系可以生成可解释的视觉语义图，但连续有向无环图学习受到执行无环性成本的限制。我们提出了polyDAG，一个用于视觉语义图中高效连续因果发现的多项式无环性框架。polyDAG用有限多项式迹约束替代矩阵指数无环性约束，并证明了新约束恰好对有向无环图为零。我们进一步推导了一种几何级数实现，避免了显式求和循环，同时保持了相同的无环性条件。在合成Erdos-Renyi图和CelebA面部视觉属性上的实验表明，polyDAG提高了效率和结构恢复能力。在d∈{100,200,500}的修订合成协议上平均，polyDAG将平均结构汉明距离从318.4降低到285.4，并将平均F1分数从0.725提高到0.756。在100个节点时，几何变体运行时间为3.44秒，而指数基线为5.16秒，对应33.4%的加速。代码和数据公开于此https URL。

英文摘要

Modern image-analysis pipelines often convert images into structured semantic variables, such as facial attributes, object concepts, and scene descriptors. Learning directed dependencies among these variables can produce interpretable visual semantic graphs, but continuous directed acyclic graph learning is limited by the cost of enforcing acyclicity. We present polyDAG, a polynomial acyclicity framework for efficient continuous causal discovery in visual semantic graphs. polyDAG replaces the matrix-exponential acyclicity constraint with a finite polynomial trace constraint and proves that the new constraint is zero exactly for acyclic graphs. We further derive a geometric-series implementation that avoids the explicit summation loop while preserving the same acyclicity condition. Experiments on synthetic Erdos-Renyi graphs and CelebA facial visual attributes show that polyDAG improves efficiency and structure recovery. Averaged over the revised synthetic protocol with d in {100, 200, 500}, polyDAG reduces mean structural Hamming distance from 318.4 to 285.4 and improves mean F1 score from 0.725 to 0.756. At 100 nodes, the geometric variant runs in 3.44 seconds compared with 5.16 seconds for the exponential baseline, corresponding to a 33.4 percent speedup. Code and data are publicly available at https://github.com/wenhaoz-fengcai/polyDAG.

URL PDF HTML ☆

赞 0 踩 0

2606.06906 2026-06-08 cs.CL cs.AI 新提交

EASE-TTT: Evidence-Aligned Selective Test-Time Training for Long-Context Question Answering

EASE-TTT: 面向长上下文问答的基于证据对齐的选择性测试时训练

Xiaopeng Yuan, Zebin Wang, Suwen Wang, Zongxin Yang, Haohan Wang, Yushun Dong

发表机构 * University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）； Harvard University（哈佛大学）； Brion, ASML US LP ； Florida State University（佛罗里达州立大学）

AI总结提出EASE-TTT框架，通过将检索到的证据块转化为软注意力监督目标，指导查询侧参数适应，从而在保留完整上下文的情况下提升小模型的长上下文问答性能。

Comments 13 pages, 4 figures, 3 tables

详情

AI中文摘要

长上下文问答（QA）对于较小的语言模型来说仍然具有挑战性，即使输入中已经存在包含答案的证据。现有的上下文内检索方法定位并暴露候选证据块给问题，但它们止步于输入级证据暴露，而不是调整控制模型如何在整个上下文位置上分配注意力的查询侧注意力参数。相比之下，轻量级的测试时适应方法，如仅查询的测试时训练（qTTT），由于它们通用的跨度级自监督目标无法识别哪些上下文位置支持当前答案，因此未能解决证据定位问题。在本文中，我们提出了基于证据对齐的选择性测试时训练（EASE-TTT），这是一个上下文内检索增强的测试时训练框架，它将选定的证据块转换为对其标记位置的软注意力监督目标。EASE-TTT不是用检索到的块替换完整上下文，而是使用生成的注意力目标来指导查询侧适应，适应后的模型从原始完整上下文中生成最终答案。在六个LongBench QA任务和三个小型仅解码器语言模型上的实验表明，EASE-TTT在全上下文推理、仅检索基线和qTTT中实现了最强的宏平均性能，支持了长上下文QA中基于证据对齐的测试时适应。

英文摘要

Long-context question answering (QA) remains challenging for smaller language models even when answer-bearing evidence is already present in the input. Existing within-context retrieval methods localize and expose candidate evidence chunks for the question, but they stop at input-level evidence exposure rather than adapting the query-side attention parameters that control how the model allocates attention over full-context positions. In contrast, lightweight test-time adaptation methods, such as query-only test-time training (qTTT), leave evidence localization unresolved because their generic span-level self-supervised objectives do not identify which context positions support the current answer. In this paper, we propose Evidence-Aligned SElective Test-Time Training (EASE-TTT), a within-context retrieval-augmented test-time training framework that converts selected evidence chunks into a soft attention supervision target over their token positions. Instead of replacing the full context with retrieved chunks, EASE-TTT uses the resulting attention target to guide query-side adaptation, with the adapted model generating the final answer from the original full context. Experiments on six LongBench QA tasks and three small decoder-only language models show that EASE-TTT achieves the strongest macro-average performance among full-context inference, retrieval-only baselines, and qTTT, supporting evidence-aligned test-time adaptation in long-context QA.

URL PDF HTML ☆

赞 0 踩 0

2606.06903 2026-06-08 cs.CV cs.AI 新提交

Beyond Skeletons: Learning Animation Directly from Driving Videos with Same2X Training Strategy

超越骨架：使用Same2X训练策略直接从驱动视频学习动画

Yuan Zeng, Yujia Shi, Yuhao Yang, Dongxia Liu, Zongqing Lu, Wenming Yang, Qingmin Liao

发表机构 * Tsinghua University（清华大学）； Harbin Institute of Technology（哈尔滨工业大学）； Pengcheng Laboratory（鹏城实验室）

AI总结提出DirectAnimator框架，通过驱动线索三元组和Same2X训练策略，绕过姿态提取直接从原始视频学习动画，实现鲁棒且高质量的人体图像动画生成。

Comments Accepted to ICLR 2026

详情

AI中文摘要

人体图像动画旨在根据从驱动视频中提取的姿态信息，从静态参考图像生成视频。现有方法通常依赖姿态估计器提取中间表示，但在遮挡或复杂姿态下这些信号容易出错。基于这些观察，我们提出了DirectAnimator，一个绕过姿态提取并直接从原始驱动视频学习的框架。我们引入了一个由姿态、面部和位置线索组成的驱动线索三元组，以语义丰富且稳定的形式捕捉运动、表情和对齐，并通过CueFusion DiT块融合它们，以实现去噪过程中的可靠控制。为了使学习在驱动和参考身份不同时依然可靠，我们设计了Same2X训练策略，将跨身份特征与从相同身份数据中学到的特征对齐，从而正则化优化并加速收敛。大量实验表明，DirectAnimator在保持身份的同时达到了最先进的视觉质量，对遮挡和复杂关节运动具有鲁棒性，并且计算资源更少。我们的项目页面位于此https URL。

英文摘要

Human image animation aims to generate a video from a static reference image, guided by pose information extracted from a driving video. Existing approaches often rely on pose estimators to extract intermediate representations, but such signals are prone to errors under occlusion or complex poses. Building on these observations, we present DirectAnimator, a framework that bypasses pose extraction and directly learns from raw driving videos. We introduce a Driving Cue Triplet consisting of pose, face, and location cues that captures motion, expression, and alignment in a semantically rich yet stable form, and we fuse them through a CueFusion DiT block for reliable control during denoising. To make learning dependable when the driving and reference identities differ, we devise a Same2X training strategy that aligns cross-ID features with those learned from same-ID data, regularizing optimization and accelerating convergence. Extensive experiments demonstrate that DirectAnimator attains state-of-the-art visual quality and identity preservation while remaining robust to occlusions and complex articulation, and it does so with fewer computational resources. Our project page is at https://directanimator.github.io/.

URL PDF HTML ☆

赞 0 踩 0

2606.06902 2026-06-08 cs.LG 新提交

TALAN: Task-Aligned Latent Adaptation Networks for Targeted Post-Training of Large Language Models

TALAN：面向大型语言模型目标后训练的任务对齐潜在自适应网络

Chengkai Zhang, Ziteng Liu, Junpu Wang, Zeyi Tao, Yang Wang, Sagar Chordia, Qin Huang

发表机构 * Meta AI

AI总结提出TALAN，一种序列条件潜在旁路，插入Transformer残差流并与低秩适配器协同训练，在STEM/代码基准上平均提升LoRA 1.41点、DoRA 1.85点，仅增加<1%可训练参数和1.01-1.02倍推理开销。

详情

AI中文摘要

目标后训练旨在提升推理、数学和代码能力而不损害原有优势。低秩适配器高效但任务全局；激活干预输入感知但通常需要独立的探针、向量或推理时引导。我们提出TALAN（任务对齐潜在自适应网络），一种序列条件潜在旁路，插入Transformer的残差流中，并在一个SFT循环中与低秩适配器协同训练。TALAN将活动序列压缩为潜在记忆，将其重新混合为令牌级扰动，并通过受控残差更新写回。它沿六个轴配置：插入位置、记忆大小、混合器、写回规则、可训练范围和梯度尺度。在四个Qwen3系列骨干和四个STEM/代码基准上，TALAN改进了匹配的LoRA和DoRA基线。使用LoRA，它实现了+1.41点的跨模型平均增益，在所有四个骨干上为正，在所有16个模型-基准单元上非负。使用DoRA，它实现了+1.85点的平均增益，在所有骨干上为正，在16个单元中的13个上为正。配对种子检查支持正平均效应但显示非平凡方差，因此我们将其视为敏感性检查。成本很小：相对于骨干的可训练参数<1%，推理开销为匹配LoRA的1.01-1.02倍。在Llama-3.2-1B上的迁移探针在LoRA和rsLoRA下，跨七个配对种子也呈正效应，支持超越Qwen的迁移。内部状态分析表明TALAN是一种小的互补激活干预。匹配的适配器更新比TALAN扰动大80-1700倍，但它们的余弦接近零；逐层测量显示这种小的正交扰动通过深度传播和放大。TALAN为在标准适配器后训练中研究可引导的激活级自适应提供了一个实用平台。

英文摘要

Targeted post-training aims to improve reasoning, math, and code without degrading strengths. Low-rank adapters are efficient but task-global; activation interventions are input-aware but often require separate probes, vectors, or inference-time steering. We introduce TALAN (Task-Aligned Latent Adaptation Networks), a sequence-conditioned latent side path inserted into a transformer's residual stream and co-trained with a low-rank adapter in one SFT loop. TALAN compresses the active sequence into latent memory, remixes it into token-level perturbations, and writes them back through a controlled residual update. It is configured along six axes: insertion location, memory size, mixer, writeback rule, trainability scope, and gradient scale. Across four Qwen3-family backbones and four STEM/code benchmarks, TALAN improves matched LoRA and DoRA baselines. With LoRA, it yields a +1.41 point cross-model mean gain, positive on all four backbones and non-negative on all 16 model-benchmark cells. With DoRA, it yields a +1.85 point mean gain, positive on all backbones and on 13 of 16 cells. Paired seed checks support positive average effects but show nontrivial variance, so we treat them as sensitivity checks. Cost is small: <1% trainable parameters relative to the backbone and 1.01-1.02x inference overhead versus matched LoRA. A Llama-3.2-1B transfer probe is also positive under LoRA and rsLoRA across seven paired seeds, supporting a transfer beyond Qwen. Internal-state analyses suggest TALAN is a small complementary activation intervention. The matched adapter update is 80-1,700x larger than the TALAN perturbation, yet their directions have near-zero cosine; per-layer measurements show this small orthogonal perturbation propagates and amplifies through depth. TALAN offers a practical platform for studying steerable activation-level adaptation within standard adapter-based post-training.

URL PDF HTML ☆

赞 0 踩 0

2606.06901 2026-06-08 cs.CV 新提交

LUCID: Learning Unified Control for Image Deflaring and Exposure Mastery in Nighttime Photography

LUCID：夜间摄影中图像去眩光与曝光控制的统一学习

Tingyu Yang, Yuan Cheng, Xiaoyun Yuan

发表机构 * MoE Key Lab of Artificial Intelligence（人工智能混合专家实验室）； AI Institute（人工智能研究所）； School of Computer Science（计算机科学学院）； School of Biomedical Engineering（生物医学工程学院）； School of Artificial Intelligence（人工智能学院）

AI总结提出LUCID统一框架，通过眩光解缠模块和扩散驱动模块联合处理夜间图像中的眩光和噪声，并引入四模式训练实现可控恢复，支持HDR重建，性能优于现有方法。

Comments Accepted by SIGGRAPH 2026

详情

AI中文摘要

摄影是用光绘画的艺术，但夜间场景受到相互竞争的退化影响：强烈的眩光掩盖了场景结构，而光子受限区域则陷入噪声。传统方法孤立地处理这些因素，忽略了这些退化本质上是纠缠的。为弥补这一差距，我们引入了LUCID，一个统一框架，将夜间恢复重新定义为连续且可控的过程，而非固定的校正。我们将夜间恢复分解为两个协作组件：一个眩光解缠模块，用于揭开光学伪影的“幕布”，提供可靠的结构指导；以及一个扩散驱动模块，利用生成先验重建干净且曝光良好的图像。关键的是，LUCID通过一种新颖的四模式训练策略引入了显式的可控性，使用户能够通过无分类器引导（CFG）引导恢复过程，并允许对光源及其相关的眩光和鬼影伪影进行选择性控制，同时通过连续曝光控制支持高动态范围（HDR）重建。大量实验表明，LUCID在多种真实夜间场景中始终优于最先进的方法。

英文摘要

Photography is the art of painting with light, yet nighttime scenes are shaped by competing degradations: intense flares obscure scene structure, while photon-limited regions collapse into noise. Conventional approaches address these factors in isolation, overlooking the fact that these degradations are fundamentally entangled. To bridge this gap, we introduce LUCID, a unified framework that reframes nighttime restoration as a continuous and controllable process rather than a fixed correction. We decompose nighttime restoration into two cooperative components: a flare disentanglement module that lifts the 'curtain' of optical artifacts to provide reliable structural guidance, and a diffusion-driven module that leverages generative priors to reconstruct clean and well-exposed imagery. Crucially, LUCID introduces explicit controllability through a novel four-mode training strategy, enabling users to steer the restoration process via classifier-free guidance (CFG) and allowing selective control over light sources and their associated flare and ghosting artifacts, while also supporting high dynamic range (HDR) reconstruction through continuous exposure control. Extensive experiments demonstrate that LUCID consistently outperforms state-of-the-art methods across diverse real-world nighttime scenarios.

URL PDF HTML ☆

赞 0 踩 0

2606.06899 2026-06-08 cs.CV cs.LG 新提交

Lighting-Aware Representation Learning under Controllable Lighting Variation

可控光照变化下的光照感知表示学习

Lizhen Zhu, Charantej Reddy Pochimireddy, James Z Wang, Brad Wyble

发表机构 * The Pennsylvania State University（宾夕法尼亚州立大学）

AI总结提出光照感知表示学习框架，将光照变化作为显式训练信号，通过辅助目标捕获光照依赖变化，在分类和检测任务上优于标准对比学习基线。

详情

AI中文摘要

光照变化仍然是视觉表示学习的主要挑战，因为它们会在环境内部和之间引起显著的外观变化。虽然现有方法通常通过数据增强来鼓励模型对光照变化具有不变性，但这些策略在学习过程中并未显式建模光照信息。受人类视觉理论的启发，我们提出了一种光照感知表示学习框架，该框架将光照变化作为显式训练信号而非需要抑制的干扰因素。我们的方法通过引入一个辅助目标来扩展对比学习，该目标捕获渲染场景中光照依赖的变化，使模型能够联合学习保持语义一致性的表示，同时保持对光照依赖的视觉结构的敏感性。我们在ImageNet、ExDark和PASCAL VOC基准测试上评估了所提模型的图像分类和物体检测任务。结果表明，所提出的光照感知训练在保持相同架构和训练预算的情况下，始终优于标准对比学习基线。此外，我们的方法在监督学习框架和涉及更简单光照变化的设置中表现出有前景的性能，表明其具有超越复杂光照场景的广泛适用性。这些结果显示了它在复杂视觉环境以及更常规的图像处理任务中增强模型鲁棒性和适应性的潜力。

英文摘要

Variations in illumination remain a major challenge for visual representation learning, as they induce substantial appearance changes both across and within environments. While existing approaches typically address this issue through data augmentations that encourage models to become invariant to lighting changes, such strategies do not explicitly model lighting information during learning. Inspired by theories of human vision, we propose a lighting-aware representation learning framework that incorporates illumination variation as an explicit training signal rather than a nuisance factor to be suppressed. Our method extends contrastive learning by introducing an auxiliary objective that captures illumination-dependent variation in rendered scenes, enabling the model to jointly learn representations that preserve semantic consistency while remaining sensitive to lighting-dependent visual structure. We evaluate the proposed model on image classification and object detection tasks across the ImageNet, ExDark, and PASCAL VOC benchmarks. Results demonstrate that the proposed lighting-aware training consistently improves downstream performance over standard contrastive learning baselines, while maintaining the same architecture and training budget. Furthermore, our approach shows promising performance in supervised learning frameworks and under settings involving simpler lighting variation, suggesting broad applicability beyond complex illumination scenarios. These results indicate its potential to enhance model robustness and adaptability in complex visual environments as well as in more conventional image processing tasks.

URL PDF HTML ☆

赞 0 踩 0

2606.06893 2026-06-08 cs.AI 新提交

Workflow-to-Skill: Skill Creation via Routing-Workflow-Semantics-Attachments Decomposition

工作流到技能：通过路由-工作流-语义-附件分解创建技能

Yuyang Zhang, Xinyuan Han, Xudong Jiang, Run Wang

发表机构 * Key Laboratory of Aerospace Information Security and Trusted Computing, Ministry of Education, School of Cyber Science and Engineering, Wuhan University（航天信息安全部门与可信计算重点实验室，教育部，网络安全科学与工程学院，武汉大学）； Nanchang University（南昌大学）

AI总结提出RWSA中间表示和W2S框架，从异构交互证据中自动构建技能，通过分解工作流结构、执行语义和运行时附件，提升行为重放一致性10.5%。

Comments 10 pages, 2 figures

详情

AI中文摘要

大型语言模型代理越来越依赖技能来编码程序性知识，但高质量技能的手工编写成本高昂。本文研究从异构交互证据（包括演示、代理轨迹、工具痕迹和执行日志）自动构建技能。我们认为，从痕迹到技能的构建并非简单的摘要任务，因为痕迹是碎片化、冗余的，并且可能遗漏罕见但安全关键的行为。为此，我们引入RWSA，一种面向工作流的中间表示，将技能分解为工作流结构、执行语义和运行时附件，捕获任务分解、控制流、验证、安全、回滚和状态管理。基于RWSA，我们提出W2S框架，该框架分割痕迹、诱导局部技能草稿、对齐共享结构、协调分支、压缩冗余，同时保留证据和置信度注释。在70个技能上的实验表明，W2S相比基于摘要和提示的基线，行为重放一致性提高了10.5%，凸显了将痕迹视为可执行运行时规范而非可压缩文本的必要性。

英文摘要

Large language model agents increasingly rely on Skills to encode procedural knowledge, yet high-quality Skills remain costly to hand-write. This paper studies automatic Skill construction from heterogeneous interaction evidence, including demonstrations, agent trajectories, tool traces, and execution logs. We argue that trace-to-skill construction is not simple summarization tasks, because traces are fragmented, redundant, and may miss rare but safety-critical behaviors. To address this, we introduce RWSA, a workflow-oriented intermediate representation that decomposes Skills into Workflow structure, execution Semantics, and runtime Attachments, capturing task decomposition, control flow, verification, safety, rollback, and state management. Building on RWSA, we propose W2S, a framework that segments traces, induces local Skill drafts, aligns shared structures, reconciles branches, and compresses redundancy while preserving evidence and confidence annotations. Experiments on 70 Skills show that W2S improves behavioral replay consistency by 10.5% over summarization- and prompting-based baselines, highlighting the need to treat traces as executable runtime specifications rather than compressible text.

URL PDF HTML ☆

赞 0 踩 0