arXivDaily arXiv每日学术速递 周一至周五更新
重置
EESS电气与系统 119
2606.18196 2026-06-17 eess.SP 新提交

Receiver-Aware Analysis and Verification of the Spectral Separation Coefficient Under Interference-Induced Degradation

接收机感知的干扰诱导退化下频谱分离系数的分析与验证

Lucas Heublein, Fabian Benschuh, Alexander Rügamer, Felix Ott

AI总结 本文通过引入接收机前端特性计算依赖接收机的频谱分离系数(SSC),并利用真实和仿真数据集实验验证了干扰影响计算的鲁棒性。

Comments 7 pages, 4 figures

详情
AI中文摘要

干扰对基于卫星的定位系统构成重大挑战,因此准确量化特定干扰类型对接收机性能以及由此产生的位置计算可靠性的影响至关重要。当前实践中,干扰影响通常使用与接收机无关的指标进行量化,而接收机特定的前端特性要么被理想化,要么仅被隐含考虑。在本文中,我们通过将接收机特定的前端特性明确纳入干扰影响的计算中,并通过实验验证所得的依赖接收机的分析,来解决这一局限性。因此,我们记录了一个包含210个不同干扰场景的真实世界开放场数据集,并针对特定接收机模块计算了依赖接收机的频谱分离系数(SSC)和干扰影响。此外,我们使用由射频星座模拟器(RFCS)生成的受控数据集验证了计算,该模拟器采用相同的接收机模块并回放类似的干扰类别。两种环境下获得的结果比较证明了干扰影响计算的鲁棒性。

英文摘要

Interference poses a significant challenge to satellite-based positioning systems, making it essential to accurately quantify the effects of specific interference types on receiver performance and the resulting reliability of position computation. In current practice, interference effects are often quantified using receiver-independent metrics, with receiver-specific front-end characteristics either idealized or only implicitly considered. In this paper, we address this limitation by explicitly incorporating receiver-specific front-end characteristics into the computation of interference effects and validating the resulting receiver-dependent analysis experimentally. Therefore, we record a real-world open-field dataset comprising 210 distinct interference scenarios and compute the receiver-dependent spectral separation coefficient (SSC) and interference impact for a specific receiver module. Furthermore, we verify the computation using a controlled dataset generated with a radio frequency constellation simulator (RFCS), employing the same receiver module and replaying similar interferences classes. The comparison of results obtained in both environments demonstrates the robustness of the interference impact computation.

2606.18134 2026-06-17 eess.AS 新提交

Grounding Spoken LLMs in Multi-Speaker Audio via Diarization Conditioning

通过说话人日志条件将口语大语言模型扩展到多说话人音频

Alexander Polok, Samuele Cornell, Sathvik Udupa, Jan Černocký, Shinji Watanabe, Lukáš Burget

AI总结 提出基于说话人日志条件的口语语言模型,通过条件化声学编码器提取目标说话人表示,避免序列化输出训练导致的灾难性遗忘,在多个数据集上显著提升说话人属性转录性能。

Comments Accepted to Interspeech 2026

详情
AI中文摘要

我们提出了说话人日志条件的口语语言模型(SLMs),这是一种将SLMs扩展到远场多说话人音频的策略。不同于通过序列化输出训练来调整解码器(这有灾难性遗忘的风险),我们通过说话人日志掩码条件化声学编码器以提取目标说话人表示,同时保持解码器冻结。我们将其实例化为Dixtral,将说话人日志条件的Whisper(DiCoW)编码器集成到Voxtral SLM中。在AMI、NOTSOFAR-1、LibriSpeechMix和Mixer6上,Dixtral在说话人属性转录方面分别以29.0%、19.8%和16.0%的绝对cpWER优于Gemini 3.0 Flash、VibeVoice和Voxtral Mini Transcribe V2。在一个新颖的长篇多说话人问答基准上,零样本Dixtral在远场内容理解上与Gemini持平,而经过微调后,在所有任务上均超越了Gemini和基于近讲语音的Voxtral。

英文摘要

We propose diarization-conditioned spoken language models (SLMs), a strategy for extending SLMs to far-field multi-talker audio. Rather than adapting the decoder via Serialized Output Training, which risks catastrophic forgetting, we condition the acoustic encoder on diarization masks to extract target-speaker representations, keeping the decoder frozen. We instantiate this as Dixtral, integrating a Diarization Conditioned Whisper (DiCoW) encoder into the Voxtral SLM. On AMI, NOTSOFAR-1, LibriSpeechMix, and Mixer6, Dixtral outperforms Gemini 3.0 Flash, VibeVoice, and Voxtral Mini Transcribe V2 on speaker-attributed transcription by 29.0%, 19.8%, and 16.0% absolute cpWER respectively. On a novel long-form multi-speaker QA benchmark, zero-shot Dixtral matches Gemini on far-field content understanding, and when fine-tuned surpasses both Gemini and Voxtral operating on close-talk across all tasks.

2606.18072 2026-06-17 eess.AS 新提交

One-Step Token-to-Waveform Generation with MeanFlow in Latent Space

基于潜在空间中MeanFlow的一步式Token到波形生成

Zheqi Dai, Guangyan Zhang, Zhen Ye, Jingyu Li, Haolin He, Chunyat Wu, Yiwen Guo, Qiuqiang Kong

AI总结 提出MeanFlow在高度压缩潜在空间中实现一步式Token2Wav生成,解决多步流匹配解码器的速度-质量权衡,RTF提升17倍且质量损失可忽略。

Comments 5 pages, 1 figure

详情
AI中文摘要

神经音频编解码器是现代基于LLM的文本到语音(TTS)和多模态系统的核心。随着低比特率语义编解码器的重要性日益增加,Token到波形(Token2Wav)解码器成为决定感知质量和系统效率的瓶颈。传统的多步流匹配解码器提供了卓越的质量,但由于迭代采样导致高推理延迟,造成了严重的质量-速度权衡。在本文中,我们提出了一种新颖的Token2Wav架构,通过在高度压缩的潜在空间中应用MeanFlow来克服这一限制。通过建模平均速度而非瞬时速度场,MeanFlow实现了真正的一步生成。在潜在域中操作减轻了波形级流的内存和稳定性问题,与多步基线相比,实时因子(RTF)提升了高达17倍,且质量下降可忽略。此外,我们引入了缓解潜在不匹配的细化策略,包括冻结MeanFlow生成器的仅解码器微调和端到端联合微调,在不增加推理时间成本的情况下提高了保真度。代码和演示已公开。

英文摘要

Neural audio codecs are central to modern LLM-based Text-to-Speech (TTS) and multimodal systems. As low-bitrate semantic codecs gain prominence, the Token-to-Waveform (Token2Wav) decoder becomes a bottleneck determining both perceptual quality and system efficiency. Conventional multi-step flow-matching decoders offer superior quality but suffer from high inference latency due to iterative sampling, creating a severe quality-speed trade-off. In this paper, we propose a novel Token2Wav architecture that overcomes this limitation by applying MeanFlow in a highly compressed latent space. By modeling the average velocity rather than the instantaneous velocity field, MeanFlow enables true one-step generation. Operating in the latent domain mitigates the memory and stability issues of waveform-level flows, yielding up to a 17$\times$ improvement in Real-Time Factor (RTF) compared to multi-step baselines with negligible quality degradation. Furthermore, we introduce refinement strategies that mitigate latent mismatch, including decoder-only fine-tuning with the MeanFlow generator frozen and end-to-end joint fine-tuning, improving fidelity without increasing inference-time cost. Code and demo are publicly available.

2606.18054 2026-06-17 eess.AS 新提交

AI-based Cognitive-linguistic Features for Dementia Assessment in Picture Description

基于AI的认知语言特征在图片描述任务中的痴呆评估

Lingfeng Xu, Prad Kadambi, Samuel Goldinger, Visar Berisha, Kimberly D. Mueller, Julie Liss

AI总结 提出七个针对Cookie Theft图片描述任务的临床构念,利用大语言模型生成严重度评分和解释,Claude 3.5 Sonnet在ADReSS数据集上达到85%准确率,专家一致性评分3.99/5,展示了LLM在可解释认知筛查中的潜力。

Comments 10 pages, 2 figures

详情
AI中文摘要

图片描述为认知语言能力的多个临床构念提供了有价值的见解。然而,将这些构念转化为定量测量仍然具有挑战性,限制了可解释性和临床实用性。我们引入了七个针对Cookie Theft图片描述任务定制的构念,并提示大语言模型(LLMs)对其进行评估,生成严重度评分和基于示例的解释。在所检查的LLMs中,Claude 3.5 Sonnet表现最佳,其生成的严重度评分能够显著区分认知障碍个体与健康对照组。该模型在ADReSS数据集上达到了85%的高准确率。专家对Claude的评分和解释进行评估,平均一致性为3.99/5。研究结果展示了LLMs在操作化临床构念和生成可解释评估方面的潜力,为开发可访问的认知筛查工具提供了一种有前景的方法。

英文摘要

Picture descriptions provide valuable insights into several clinical constructs related to cognitive-linguistic abilities. However, operationalizing these constructs into quantitative measures remains challenging, limiting interpretability and clinical utility. We introduced seven constructs tailored to the Cookie Theft picture description task and prompted large language models (LLMs) to evaluate them, generating severity scores and example-based explanations. Among the examined LLMs, Claude 3.5 Sonnet performed the best, producing severity scores that significantly distinguish cognitively impaired individuals from healthy controls. The model achieves a high accuracy of 85% on the ADReSS dataset. Expert evaluation of Claude's scores and explanations yields a 3.99/5 average agreement. The findings demonstrate the potential of LLMs to operationalize clinical constructs and generate interpretable evaluations, offering a promising approach for accessible cognitive screening tools.

2606.17942 2026-06-17 eess.SP 新提交

On the Optimum Energy-per-bit Launch Power in Coherent Hollow-core Fibre Transmission Systems

相干空芯光纤传输系统中每比特能量最优发射功率研究

Ronit Sohanpal, Eric Sillekens, Mindaugas Jarmolovicius, Robert I. Killey, Polina Bayvel

AI总结 本文研究空芯光纤传输系统中每比特能量最优发射功率,发现1000公里C波段链路在最小每比特能量发射功率下可降低总功耗41.5%,吞吐量仅损失2.2%。

Comments European Conference on Optical Communications (ECOC) 2026

详情
AI中文摘要

我们研究了空芯光纤传输系统中的每比特最优能量。结果表明,当以最小每比特能量发射功率运行时,1000公里C波段链路的总功耗可降低41.5%,而吞吐量损失仅为2.2%。

英文摘要

We investigate the optimum energy per bit in hollow-core-fibre transmission systems. We show that a 1000 km C-band link can achieve a 41.5% reduction in total power consumption when operating at the minimum energy-per-bit launch power with only 2.2% throughput penalty.

2606.17903 2026-06-17 eess.SP 新提交

Constellation Design for Nonlinear Unified SWIPT Receiver Channels with Memory

非线性统一SWIPT接收机信道带记忆的星座设计

Triantafyllos Mavrovoltsos, Elio Faddoul, Zulqarnain Bin Ashraf, Constantinos Psomas, Besma Smida, Ioannis Krikidis

AI总结 针对非线性统一SWIPT接收机信道,提出考虑记忆效应的星座设计方法,通过状态自适应策略和自编码器框架优化误符号率与能量收集的折中。

Comments Submitted to IEEE Transactions on Communications

详情
AI中文摘要

统一接收机(UR)已成为同时无线信息和能量传输(SWIPT)的一种有前景架构,因为共同的整流前端能够从同一整流输出中实现信息解码(ID)和能量收集(EH)。然而,由于二极管的非线性,整流是非线性的,而电容器在符号间引入记忆,使得信道上的星座设计具有挑战性。本文研究了无记忆和有记忆机制下非线性UR-SWIPT信道的星座设计。首先,我们提出一个易处理的统一整流模型,该模型同时捕捉(i)非线性稳态映射和(ii)瞬态操作下的非对称电容器充放电动力学。为了隔离带记忆的整流对ID的影响,我们研究了基于信息的设计。在此设置中,我们开发了一种状态自适应策略,该策略具有算法星座设计,考虑整流器状态并在观测域中塑造星座。通过近似整流器状态分布,我们推导出闭式平均符号错误率(SER)表达式,并表征速率-可靠性(R-R)折中。然后,我们寻找在平均发射功率和EH约束下最小化SER的星座。我们使用基于自编码器的框架解决无记忆机制中的能量约束设置,该框架将非线性整流模型嵌入为可微信道块。数值结果验证了所提模型,展示了记忆对R-R折中的影响,并展示了学习星座如何适应速率-能量折中的EH需求。

英文摘要

Unified receivers (URs) have emerged as a promising architecture for simultaneous wireless information and power transfer (SWIPT), since a common rectifying front-end enables information decoding (ID) and energy harvesting (EH) from the same rectified output. However, rectification is nonlinear due to the diode, while the capacitor introduces memory across symbols, making constellation design over the channel challenging. In this paper, we study constellation design for nonlinear UR-SWIPT channels in both memoryless and memory regimes. First, we propose a tractable unified rectification model that captures both (i) the nonlinear steady-state mapping and (ii) the asymmetric capacitor charging/discharging dynamics under transient operation. To isolate the impact of rectification with memory on ID, we study the information-based design. In this setting, we develop a state-adaptive policy with an algorithmic constellation design that accounts for the rectifier state and shapes the constellation in the observation domain. By approximating the rectifier state distribution, we derive a closed-form average symbol error rate (SER) expression and characterize the rate-reliability (R-R) tradeoff. We then seek constellations that minimize the SER under average transmit power and EH constraints. We address the resulting energy-constrained setting in the memoryless regime using an autoencoder-based framework that embeds the nonlinear rectification model as a differentiable channel block. Numerical results validate the proposed models, demonstrate the impact of memory on the R-R tradeoff, and show how learned constellations adapt to EH requirements in the rate-energy tradeoff.

2606.17900 2026-06-17 eess.SP 新提交

Time-Slotted Multi-Cluster UAV AirComp with Energy-Awareness: A Pointer Network-Assisted Soft Actor-Critic Learning Framework

时间分槽多簇无人机空中计算与能量感知:一种指针网络辅助的软演员-评论家学习框架

Xunqiang Lan, Xiao Tang, Ruonan Zhang, Qinghe Du, Tony Q.S. Quek

AI总结 提出无人机辅助的空中计算系统,通过联合优化波束成形、归一化因子、传感器调度和无人机轨迹,最小化聚合误差和能耗,并采用分层学习框架(指针网络和软演员-评论家)求解。

Comments Accepted @ IEEE JSTSP

详情
AI中文摘要

空中计算(AirComp)已成为大规模数据聚合的一种有前景的方法,但受到信道变化、任务分布以及计算节点固有能量限制的挑战。本文提出了一种无人机辅助的空中计算系统,用于随时间服务多簇计算任务,利用无人机移动性促进的空间和时间分集实现高效准确的数据计算。具体而言,我们旨在通过联合优化收发波束成形、归一化因子、传感器调度和无人机轨迹,最小化空中计算聚合误差和能耗。为了解决所提出的问题,我们将其分解为两层:内层处理基于优化的空中计算收发器设计,外层专注于基于深度强化学习的调度和轨迹设计。特别地,开发了一种指针网络演员-评论家学习来处理二元调度问题,并采用软演员-评论家深度强化学习算法确定无人机轨迹。仿真结果验证了所提出的分层学习框架的收敛性,并表明与基线方案相比,在聚合误差和能耗方面具有显著的性能提升。

英文摘要

Over-the-air computation (AirComp) has emerged as a promising approach for massive data aggregation, which is yet challenged by the channel variations, task distributions, and inherent energy limitation of the computation nodes. In this paper, we propose an unmanned aerial vehicle (UAV)-assisted Aircomp system to serve multi-cluster computation tasks over time, where the UAV mobility-facilitated spatial and time diversity is exploited for efficient and accurate data computation. Specifically, we aim for the minimization of AirComp aggregation error and the energy consumption by jointly optimizing the transceiver beamforming, normalizing factors, sensor scheduling, and UAV trajectory. To solve the formulated problem, we decompose it into two layers where the inner layer addresses the optimization-based AirComp transceiver design, and the outer layer focuses on the deep reinforcement learning (DRL)-based scheduling and trajectory design. In particular, a pointer network actor-critic learning is developed to tackle the binary scheduling problem, and a soft actor-critic DRL algorithm is employed to determine the UAV trajectory. Simulation results validate the convergence of the proposed hierarchical learning framework and demonstrate its significant performance gains in terms of AirComp aggregation error and energy consumption as compared with baseline schemes.

2606.17893 2026-06-17 eess.SP 新提交

Condition-Wise Sinkhorn Drifting for One-Shot Learned Channel Simulation

条件式Sinkhorn漂移用于一次性学习信道仿真

Rick Fritschek, Rafael F. Schaefer

AI总结 针对学习通信系统中扩散式反向采样成本高的问题,提出条件式Sinkhorn漂移,一种一次性信道替代方法,通过条件Sinkhorn目标训练生成器,在AWGN、瑞利衰落等信道下评估,条件式变体在条件诊断和符号编码检查中表现最强。

Comments 12 pages, 3 figures

详情
AI中文摘要

学习通信系统可能在可微训练循环中评估随机信道替代模型数百万次,这使得扩散式反向采样成本高昂。本文提出条件式Sinkhorn漂移,一种一次性信道替代方法,它保留传输符号并仅传输条件输出分布\(p(y\mid x)\)。我们对相同传输符号的重复输出制定条件Sinkhorn目标,并通过有限样本重心速度后接分离粒子回归来训练生成器。在加性高斯白噪声(AWGN)、瑞利衰落、固态功率放大器(SSPA)非线性和紧凑抽头延迟线(TDL)信道上的实验比较了直接漂移、联合Sinkhorn漂移、条件式Sinkhorn漂移、条件去噪扩散概率建模(DDPM)、去噪扩散隐式建模(DDIM)和Wasserstein生成对抗网络(WGAN)参考。在评估的一次性漂移族变体中,条件式Sinkhorn在条件诊断和符号编码检查中表现最强,而扩散方法在最困难的下游符号错误率(SER)曲线上仍然最强。最终的操作点是一个条件保持的一次性仿真器,适用于重复信道调用使扩散式采样成本过高的场景。

英文摘要

Learned communication systems may evaluate stochastic channel surrogates millions of times inside differentiable training loops, making diffusion-style reverse sampling expensive. This paper proposes condition-wise Sinkhorn drifting, a one-shot channel surrogate that preserves the transmitted symbol and transports only the conditional output laws \(p(y\mid x)\). We formulate a conditional Sinkhorn objective over repeated outputs at the same transmitted symbol and train the generator with finite-sample barycentric velocities followed by detached particle regression. Experiments on additive white Gaussian noise (AWGN), Rayleigh fading, solid-state power amplifier (SSPA) nonlinearity, and a compact tapped-delay-line (TDL) channel compare direct drifting, joint Sinkhorn drifting, condition-wise Sinkhorn drifting, conditional denoising diffusion probabilistic modeling (DDPM), denoising diffusion implicit modeling (DDIM), and Wasserstein generative adversarial network (WGAN) references. Within the evaluated one-shot drifting-family variants, condition-wise Sinkhorn is strongest under conditional diagnostics and symbolic-coding checks, while diffusion remains strongest on the hardest downstream symbol-error-rate (SER) curves. The resulting operating point is a condition-preserving one-shot simulator for settings where repeated channel calls make diffusion-style sampling too costly.

2606.17879 2026-06-17 eess.AS 新提交

A 399uW 114.3 dB DR Companding Readout ASIC for MEMS Microphones Employing a Multirate Time-Domain ADC

一款用于MEMS麦克风的399μW 114.3 dB DR压扩读出ASIC,采用多速率时域ADC

Javier Granizo, Ruben Garvi, Ricardo Carrero, Jorge de la Torre, Javier Fernandez, Dietmar Straeussnigg, Andreas Wiesbauer, Luis Hernandez

AI总结 提出一种基于VCO的多速率压扩ADC架构,通过时域表示缓解边界伪影,实现114.3 dB动态范围和<400 μW功耗,适用于数字MEMS麦克风读出。

详情
AI中文摘要

数字MEMS麦克风的动态范围和灵敏度改进在高级噪声消除和语音识别等应用中至关重要。实现这些目标的一种经济有效的解决方案是压扩ADC架构。压扩ADC将动态范围分成多个具有不同量化噪声电平的段,从而放宽功率限制。压扩麦克风的一个常见问题是当输入信号穿过不同幅度段之间的边界时产生的可听伪影。本文展示了一种压扩ADC架构,该架构通过利用基于VCO的ADC中输入信号的瞬时和高分辨率时域表示来减轻边界伪影。使用多速率频率-数字转换器可以将量化噪声与VCO频率解耦,保持标准音频采样率。驱动器和振荡器电路的协同优化使我们的VCO-ADC能够在没有反馈DAC的情况下达到>112 dBc的峰值SFDR,同时保持与电容式MEMS兼容的Giga-Ohm输入阻抗。我们展示了一款0.13 μm ASIC的测量结果,该ASIC实现了数字MEMS麦克风的完整读出电路。这包括两个模拟通道以及提供标准单比特PDM输出所需的数字信号处理和校准模块。该ADC在低于400 μW的功率预算下达到114.3 dB的动态范围,Schreier FoM_{SNDR}为171.0 dB,FoM_{DR}为191.3 dB。

英文摘要

Improvements in the dynamic range and sensitivity of digital MEMS microphones are essential in applications like advanced noise canceling and voice recognition. A cost effective solution to achieve these goals is the companding ADC architecture. Companding ADCs split the dynamic range in several segments with different quantization noise levels, relaxing power constraints. A common problem of companding microphones are audible artifacts generated when the input signal crosses the boundaries between different amplitude segments. We show in this paper a companding ADC architecture that mitigates the boundary artifacts by leveraging the instantaneous and high-resolution time-domain representation of the input signal in a VCO-based ADC. The use of a multi-rate frequency-to-digital converter allows to decouple quantization noise from the VCO frequency, keeping standard audio sampling rates. Co-optimization of the driver and oscillator circuits enables our VCO-ADC to reach \textgreater 112dBc of peak SFDR without a feedback DAC, keeping a Giga-Ohm input impedance compatible with a capacitive MEMS. We show measurements of a 0.13 $\mu$m ASIC implementing a complete readout circuit for a digital MEMS microphone. This includes two analog channels and the digital signal processing and calibration blocks required to deliver a standard single-bit PDM output. This ADC reaches a dynamic range of 114.3dB with a power budget under 400 uW, a Schreier FoM_{SNDR} of 171.0 dB and a FoM_{DR} of 191.3 dB.

2606.17869 2026-06-17 eess.IV 新提交

Perceptually-Weighted Video Quality Metric for Asymmetric Encoded Sports Videos

感知加权视频质量度量用于非对称编码体育视频

Anna Meyer, Jonas Janzen, Diwakara Reddy, Alexander Kopte, Simon Deniffel, Paul Wawerek-López, Marc Windsheimer, André Kaup

AI总结 提出一种感知加权视频质量度量(PW-VQM),通过结合开放词汇目标检测和光流分析区分前景与背景,在质量聚合中赋予前景更高权重,在体育视频上SROCC达0.9511,优于SSIM、VMAF等指标。

Comments accepted for International Conference on Quality of Multimedia Experience 2025 (QoMEX'26)

详情
AI中文摘要

客观视频质量度量通常假设均匀的空间注意力,这一假设与人类视觉感知的选择性相矛盾,尤其是在体育视频中。通过语义编码为显著区域分配更多比特可以带来显著的比特率节省。我们提出了一种感知加权视频质量度量(PW-VQM),这是一种全参考度量,考虑了空间区域感知重要性的不均匀性,因此针对非对称编码内容的质量评估。在多尺度小波域中计算的SSIM图通过区分前景和背景区域进行加权。通过结合开放词汇目标检测和光流分析识别感知显著的前景区域,并在质量聚合中赋予更高权重。在体育视频内容上评估,PW-VQM实现了0.9511的斯皮尔曼等级相关系数,优于包括SSIM、VMAF、FUNQUE和LPIPS在内的现有度量。消融研究证实了感知加权各组成部分的单独贡献。

英文摘要

Objective video quality metrics commonly assume uniform spatial attention, an assumption that conflicts with the selective nature of human visual perception, particularly in sports videos. Here, allocating more bits for salient regions through semantic encoding can lead to significant bitrate savings. We present a Perceptually-Weighted Video Quality Metric (PW-VQM), a full-reference metric that accounts for the unequal perceptual importance of spatial regions and therefore targets quality evaluation for asymmetrically encoded content. SSIM maps computed in a multiscale wavelet domain are weighted by differentiating between foreground and background regions. Perceptually salient foreground regions are identified by combining open-vocabulary object detection with optical flow analysis, and are assigned higher weight during quality aggregation. Evaluated on sports video content, PW-VQM achieves a Spearman Rank Order Correlation Coefficient of 0.9511, outperforming established metrics including SSIM, VMAF, FUNQUE, and LPIPS. An ablation study confirms the individual contributions of the components of the perceptual weighting.

2606.17806 2026-06-17 eess.AS 新提交

PhASE-Flow: Phonetic-Conditioned Acoustic Flow Matching in SSL Representation Domain for Speech Enhancement

PhASE-Flow:语音增强中SSL表示域内基于音素条件的声学流匹配

Jun Gao, Xiaobin Rong, Yu Sun, Dahan Wang, Jing Lu

AI总结 提出PhASE-Flow,一种在自监督学习表示空间中直接建模的流匹配语音增强框架,通过音素条件生成干净声学表示,仅需4步采样即可达到领先性能。

Comments Accepted by Interspeech 2026

详情
AI中文摘要

流匹配(FM)能够实现高保真生成,而自监督学习(SSL)语音模型提供跨越声学和音素层次的分层表示。然而,现有的基于FM的语音增强(SE)方法主要在频谱域中操作,仅将SSL特征作为外部条件,而非直接在SSL潜在空间中建模。为了充分利用SSL表示的结构丰富性,我们提出了PhASE-Flow,一个完全在SSL空间中运行的基于FM的SE框架。它建模给定音素表示的干净声学表示的条件分布,并通过神经声码器重建波形。实验表明,PhASE-Flow在感知质量和可懂度上优于最先进的基线。值得注意的是,它仅用四个采样步骤即可达到竞争性能,实现了高效推理。音频演示可在此网址获取:https://this URL。

英文摘要

Flow matching (FM) enables high-fidelity generation, while self-supervised learning (SSL) speech models provide hierarchical representations spanning acoustic and phonetic levels. However, existing FM-based speech enhancement (SE) methods operate primarily in the spectral domain, treating SSL features only as external conditions rather than modeling directly in the SSL latent space. To fully exploit the structural richness of SSL representations, we propose PhASE-Flow, an FM-based SE framework that operates entirely in the SSL space. It models the conditional distribution of clean acoustic representations given phonetic ones, reconstructing the waveform via a neural vocoder. Experiments show that PhASE-Flow outperforms state-of-the-art baselines in perceptual quality and intelligibility. Notably, it achieves competitive performance with only four sampling steps, enabling highly efficient inference. Audio demos are available at this https URL.

2606.17801 2026-06-17 eess.SP 新提交

Joint Direction-of-Arrival and Range Estimation for Millimeter-Wave Uniform Linear Array Radar

毫米波均匀线性阵列雷达的联合到达角与距离估计

Necati Kagan Erkek, Zeynep Gul Pehlivanli

AI总结 提出一种基于FFT的到达角与距离估计框架,用于77 GHz单基地均匀线性阵列雷达,通过窄带和宽带波形实现高精度角度与距离估计。

Comments 6 pages

详情
AI中文摘要

提出一种基于FFT的到达角(DOA)和距离估计框架,用于工作在77 GHz的单基地均匀线性阵列(ULA)。使用窄带正弦波形推导空间相位模型,确定无混叠的阵元间距,并选择所需孔径以获得2度的视轴角分辨率。最终设计采用0.97 mm的阵元间距和58个天线单元,对应孔径长度为56.42 mm。数值结果表明,对于30度处的单个目标和多个同时目标,角度估计准确。通过将窄带波形替换为1 GHz sinc调制信号,将分析扩展到二维定位,该信号提供约0.15 m的距离分辨率。额外仿真量化了加性复高斯噪声、增大天线间距和目标去相关对DOA响应的影响。

英文摘要

An FFT-based direction-of-arrival (DOA) and range-estimation framework for a monostatic uniform linear array (ULA) operating at 77 GHz is presented. A narrowband sinusoidal waveform is used to derive the spatial phase model, determine an aliasing-free inter-element spacing, and select the aperture required to obtain a boresight angular resolution of 2 degree. The resulting design uses an element spacing of 0.97 mm and 58 antenna elements, corresponding to an aperture length of 56.42 mm. Numerical results show accurate angular estimation for a single target at 30 degree and for multiple simultaneous targets. The analysis is further extended to two-dimensional localization by replacing the narrowband waveform with a 1 GHz sinc-modulated signal, which provides an approximate range resolution of 0.15 m. Additional simulations quantify the effects of additive complex Gaussian noise, increased antenna spacing, and target decorrelation on the DOA response.

2606.17737 2026-06-17 eess.SP 新提交

Deep CSI Feedback for FDD Massive MIMO Systems: A Curvelet Learning Approach

FDD大规模MIMO系统的深度CSI反馈:一种曲波学习方法

Mengli Tao, Jiancun Fan, Huiqiang Xie, Kai Xie

AI总结 针对FDD大规模MIMO系统中CSI反馈开销大的问题,提出基于曲波变换的SwinCANet框架,通过频域分解与注意力机制提升重建质量,并引入去噪变体抑制噪声,仿真验证了其优越性能。

详情
AI中文摘要

下行信道状态信息(CSI)反馈在频分双工(FDD)大规模多输入多输出(mMIMO)系统中起着关键作用。超大规模MIMO中天线数量的增长增加了CSI反馈的难度和开销,这对传统的下行CSI反馈机制构成了重大挑战。为了解决现有CSI反馈方法的局限性,本文提出了一种基于曲波学习的新框架,称为SwinCANet,包括频域信息处理模块和去噪模块。频域信息处理模块采用曲波变换将CSI分解为低频和高频分量。随后,分别利用Swin Transformer和通道注意力块提取低频和高频表示,从而提高重建质量。值得注意的是,额外的Swin Transformer促进了多尺度频率分量的融合,增强了不同角度分辨率和空间方向上的能力。此外,我们开发了一种变体(De-SwinCANet),它采用Sigmoid阈值函数有效抑制噪声系数,从而减轻各种信道损伤和非线性失真。数值仿真结果表明,所提出的方法在具有挑战性的传播条件下实现了优于现有基准的性能,同时保持了鲁棒性。

英文摘要

Downlink channel state information (CSI) feedback plays a key role in frequency division duplex (FDD) massive multiple-input multiple-output (mMIMO) systems. The growth of antennas in ultra-massive MIMO increases the difficulty and overhead of CSI feedback, which poses significant challenges for conventional downlink CSI feedback mechanisms. To address the limitations of existing CSI feedback approaches, this paper proposes a novel curvelet learning based framework termed SwinCANet, comprising a frequency-domain information processing module and a denoising module. The frequency-domain information processing module employs curvelet transform to decompose CSI into low-frequency and high-frequency components. Subsequently, Swin Transformer and channel-wise attention block are utilized for extracting the low-frequency and high-frequency representations, respectively, thereby enhancing reconstruction quality. Notably, an additional Swin Transformer facilitates the fusion of multi-scale frequency components, enhancing capabilities across different angular resolutions and spatial directions. Furthermore, we develop a variant (De-SwinCANet), which employs a Sigmoid threshold function to effectively suppress noise coefficients, thereby mitigating various channel impairments and nonlinear distortions. Numerical simulation results demonstrate that the proposed methodology achieves superior performance compared to existing benchmarks while maintaining robust performance under challenging propagation conditions.

2606.17718 2026-06-17 eess.SP 新提交

BASIIS: Bistatic Angular Sampling and Interpolation for ISAC Setups

BASIIS: 双基地ISAC设置的角度采样与插值

Alexander Felix, Marcus Henninger, Lucas Giroto, Maximilian Bauhofer, Stephan ten Brink, Silvio Mandelli

AI总结 针对双基地ISAC中收发阵列角度域的四维采样问题,提出基于正交基线共阵列的最小采样与插值方案,在保持检测精度的同时减少3-5倍收发方向对。

详情
AI中文摘要

集成感知与通信(ISAC)是6G的一个定义性特征,以有限的额外开销将蜂窝网络扩展到雷达类感知。在双基地部署中,感知需要协调发射(TX)和接收(RX)阵列以扫描离开角和到达角的笛卡尔积,导致角度域中的四维采样问题。本文为双基地ISAC建立了一个完整的角度采样框架,将基于DFT的最优采样方法扩展到两个阵列的全方位角和仰角域。我们表明双基地几何耦合了TX和RX仰角,并通过正交基线共阵列(一种捕获阵列对联合仰角孔径的虚拟阵列)表示这种耦合。从共阵列中,我们推导出一种最小采样和插值方案,该方案近乎无损且可适用于任何波束赋形架构。蒙特卡洛模拟证实,所提出的最小采集基本上等同于密集过采样成像的检测精度,同时采集的TX-RX方向对减少了3到5倍。这使得双基地操作能够大幅降低ISAC系统的无线电资源使用开销。

英文摘要

Integrated Sensing and Communications (ISAC) is a defining feature of 6G, extending cellular networks with radar-like sensing at limited additional overhead. In bistatic deployments, sensing requires coordinating the transmitter (TX) and receiver (RX) arrays to scan the Cartesian product of angle of departure and arrival, resulting in a four-dimensional sampling problem in the angular domain. This work establishes a complete angular sampling framework for bistatic ISAC, extending the DFT-based optimal-sampling methodology to the full azimuth and elevation domains of both arrays. We show that the bistatic geometry couples the TX and RX elevation angles, and represent this coupling through the ortho-baseline coarray, a virtual array that captures the joint elevation aperture of the array pair. From the coarray we derive a minimal sampling and interpolation scheme, near-lossless and realizable with any beamforming architecture. Monte Carlo simulations confirm the proposed minimal acquisition essentially equalizes the detection accuracy of dense oversampled imaging while acquiring 3 to 5 times fewer TX-RX direction pairs. This allows having bistatic operations with drastically reduced overhead on the radio resource usage of ISAC systems.

2606.17699 2026-06-17 eess.SP 新提交

Joint Synchronization and Radar Parameter Estimation for Distributed OFDM-ISAC Systems

分布式OFDM-ISAC系统的联合同步与雷达参数估计

Niclas Führling, Hyeon Seok Rou, Kuranage Roche Rayan Ranasinghe, Giuseppe Thadeu Freitas de Abreu, Nuria González-Prelcic

AI总结 针对分布式ISAC系统在双弥散信道中的同步问题,提出基于双变量高斯置信传播的联合同步与雷达参数估计方法,实现时偏、频偏及信道参数的联合估计,性能接近CRLB。

详情
AI中文摘要

我们提出了一种新颖的方法,用于在双弥散(DD)信道环境下的分布式ISAC(DISAC)系统中,通过联合同步和雷达参数估计框架实现同步。该方法利用系统模型的结构,该结构可以线性化,以应用双变量高斯置信传播(GaBP)算法,该算法在传统正交频分复用(OFDM)系统中联合估计每个基站(BS)的时间偏移(TO)和载波频率偏移(CFO),以及DD信道的时延和多普勒参数。仿真结果证明了所提算法的有效性,表明雷达参数估计(即距离和速度)和同步参数估计(即TO和CFO)即使在低到中等信噪比(SNR)条件下也接近克拉美罗下界(CRLB)。

英文摘要

We propose a novel approach to the synchronization paradigm in distributed ISAC (DISAC) systems in doubly-dispersive (DD) channel environments via a joint synchronization and radar parameter estimation framework. The proposed method exploits the structure of the system model, which can be linearized in order to apply a bivariate Gaussian belief propagation (GaBP) algorithm that jointly estimates the time offset (TO) and carrier frequency offset (CFO) of each base station (BS), as well as the delay and Doppler parameters of the DD channel in conventional orthogonal frequency division multiplexing (OFDM) systems. Simulation results demonstrate the effectiveness of the proposed algorithm, showing that the radar parameter estimates (i.e., range and velocity) and synchronization parameter estimates (i.e., TO and CFO) approach the Cramér Rao lower bound (CRLB) even at low-to-moderate signal-to-noise ratio (SNR) regimes.

2606.17662 2026-06-17 eess.AS 新提交

An Analysis of the Effectiveness of Synthetic Speech Data for ASR Fine-tuning in Selected Indic Languages

合成语音数据在选定印度语言ASR微调中的有效性分析

Sujith Pulikodan, Agneedh Basu, Pavan Kumar, Pranav Bhat, Visruth Sanka, Nihar Desai, Prasanta Kumar Ghosh

AI总结 研究在三种印度语言(印地语、卡纳达语、泰卢固语)中,将合成语音数据与真实数据结合用于ASR微调的效果,分析不同合成来源和语音克隆对性能的影响。

详情
AI中文摘要

合成数据有潜力成为训练机器学习模型(特别是自动语音识别系统)的宝贵资源,但其有效性需要系统评估。在本研究中,我们调查了将合成语音数据与真实录音结合对三种印度语言(印地语、卡纳达语和泰卢固语)的影响。我们分析了通过将合成数据与真实数据增强所获得的性能提升,并独立考察了ASR性能如何随用于生成合成语音的脚本来源而变化。此外,我们评估了使用不同语音合成模型生成的合成语音的效果。最后,我们研究了合成语音生成中的语音克隆对ASR性能的影响,包括性能如何随数据生成过程中使用的不同克隆语音数量而变化。

英文摘要

Synthetic data has the potential to be a valuable resource for training machine learning models, particularly Automatic Speech Recognition (ASR) Systems; however, its effectiveness requires systematic evaluation. In this study, we investigate the impact of incorporating synthetic speech data alongside real-world recordings for three Indic languages: Hindi, Kannada, and Telugu. We analyze the performance gains achieved by augmenting synthetic data with real data and independently examine how ASR performance varies with the sources of scripts used to generate synthetic speech. In addition, we evaluate the effect of synthetic speech generated using different speech synthesis models. Finally, we study the impact of voice cloning in synthetic speech generation on ASR performance, including how performance varies with the number of distinct cloned voices used during data generation.

2606.17641 2026-06-17 eess.SP 新提交

Toward Quantum-Enhanced ISAC: Active-RIS-Aided Integrated Sensing and Communication with Rydberg Atomic Receivers

面向量子增强型ISAC:基于活性RIS的集成感知与通信系统及里德伯原子接收机

Hong-Bae Jeon, Hyung-Joo Moon, Yonghwi Kim

AI总结 提出一种活性RIS辅助的集成感知与通信系统,利用里德伯原子接收机的幅度实域观测特性,联合设计基站波束成形与RIS反射系数以最小化克拉美-罗界,通过交替优化框架解决非凸问题。

详情
AI中文摘要

本文研究了一种采用里德伯原子接收机(RARE)的活性RIS(ARIS)辅助集成感知与通信(ISAC)系统。利用RARE的幅度实域观测结构,我们首先推导了统一的ISAC模型,并给出了到达角(DoA)估计的闭式克拉美-罗界(CRB)。基于此公式,我们提出联合设计基站(BS)波束成形和ARIS反射系数,以在RARE特定的信号干扰噪声比(SINR)和ARIS功率约束下最小化CRB。为解决由此产生的高度非凸问题,我们开发了一种交替优化(AO)框架,该框架结合了用于波束成形的半定松弛(SDR)和用于ARIS设计的基于主化-最小化(MM)的方法。数值结果表明,所提出的RARE感知框架显著优于传统的基于射频的设计,并实现了接近雷达专用基准的性能,突显了RARE在ARIS辅助量子增强型ISAC中的潜力。

英文摘要

In this paper, we investigate an active-RIS (ARIS)-aided integrated sensing and communication (ISAC) system with Rydberg Atomic REceiver (RARE). Leveraging the magnitude-only and real-domain observation structure of RARE, we first derive a unified ISAC model, along with a closed-form Cramer-Rao bound (CRB) for direction-of-arrival (DoA) estimation. Based on this formulation, we propose a joint design of the {base station (BS)} beamforming and ARIS reflection coefficients to minimize the CRB under RARE-specific signal-to-interference-noise-ratio (SINR) and ARIS power constraints. To tackle the resulting highly non-convex problem, we develop an alternating optimization (AO) framework that combines semidefinite relaxation (SDR) for beamforming and a majorization-minimization (MM)-based approach for ARIS design. Numerical results demonstrate that the proposed RARE-aware framework significantly outperforms conventional RF-based designs and achieves performance close to the radar-only benchmark, highlighting the potential of RARE for quantum-enhanced ISAC with ARIS.

2606.17570 2026-06-17 eess.IV 新提交

Fine-UNETR for PSMA PET/CT Lesion Segmentation: Automated Tumor Quantification and Overall Survival Stratification in Prostate Cancer

Fine-UNETR 用于 PSMA PET/CT 病灶分割:前列腺癌中自动肿瘤定量和总生存期分层

Mansour Abtahi, Chae Moon Hong, Nikhil Deveshwar, Stellamaris Nwihim, Peder E.Z. Larson, Thomas A. Hope

AI总结 提出基于 Vision Transformer 的 Fine-UNETR 架构,实现全身 PSMA PET/CT 病灶自动分割,并验证 AI 衍生的肿瘤负荷生物标志物在放射配体治疗前总生存期分层中的临床效用。

详情
AI中文摘要

引言:开发并评估 Fine-UNETR,一种基于 Vision Transformer 的架构,用于全身 PET/CT 上 PSMA 亲和病灶的自动分割,并评估 AI 衍生的肿瘤负荷生物标志物在放射配体治疗中总生存期分层的临床效用。方法:在这项回顾性研究中,分析了来自前列腺癌患者的 373 次 PSMA PET/CT 扫描(平均年龄 71±8 岁)。Fine-UNETR 是一种改进的 UNETR,采用 8×8×8 体素块嵌入和轴向滑动窗口训练,在 299 次扫描上训练,并在 74 次扫描上验证。在独立的 67 名放射配体治疗前患者队列中,使用 Kaplan-Meier 分析和 log-rank 检验评估总生存期分层。在来自 AutoPET IV PSMA PET/CT 数据集的 192 例病例上进行外部验证。结果:Fine-UNETR 的 Dice 相似系数(DSC)为 66.63%,灵敏度为 70.27%,精确率为 67.77%,病灶检测率为 79.53%(SUVmax ≥ 5 的病灶为 96.05%)。在外部验证数据集上,模型达到 DSC 44.11% 和病灶检测率 87.18%,表明尽管体素级重叠减少,病灶检测性能仍得以保持。AI 衍生的生物标志物与金标准具有极好的一致性(总肿瘤体积:r=0.984;总病灶摄取:r=0.989;病灶计数:r=0.960)。在临床队列中,总肿瘤体积(p=0.0019)、SUVmax(p=0.014)和 SUVmean(p=0.016)显著分层了总生存期。结论:Fine-UNETR 能够实现准确的全身 PSMA 病灶自动分割和肿瘤负荷量化。在外部数据集上的性能尽管存在域偏移的证据,但表现出鲁棒性。AI 衍生的生物标志物在放射配体治疗前队列中显著分层了总生存期,支持自动 PSMA PET/CT 量化在预后判断中的临床效用。

英文摘要

Introduction: To develop and evaluate Fine-UNETR, a Vision Transformer-based architecture for automated segmentation of PSMA-avid lesions on whole-body PET/CT, and to assess clinical utility of AI-derived tumor burden biomarkers for overall survival stratification in radioligand therapy. Methods: In this retrospective study, 373 PSMA PET/CT scans (mean age, 71+-8 years) from patients with prostate cancer were analyzed. Fine-UNETR, a modified UNETR with 8x8x8 voxel patch embedding and axial sliding window training, was trained on 299 scans and validated on 74 scans. Overall survival stratification was assessed in an independent cohort of 67 pre-radioligand therapy patients using Kaplan-Meier analysis and log-rank testing. External validation was performed on 192 cases from the AutoPET IV PSMA PET/CT dataset. Results: Fine-UNETR achieved a Dice similarity coefficient (DSC) of 66.63%, sensitivity of 70.27%, precision of 67.77%, and a lesion detection rate of 79.53% (96.05% for lesions with SUVmax >= 5). On the external validation dataset, the model achieved a DSC of 44.11% and a lesion detection rate of 87.18%, indicating that lesion detection performance was preserved despite reduced voxel-level overlap. AI-derived biomarkers showed excellent agreement with ground truth (total tumor volume: r=0.984; total lesion uptake: r=0.989; lesion count: r=0.960). In the clinical cohort, total tumor volume (p=0.0019), SUVmax (p=0.014), and SUVmean (p=0.016) significantly stratified overall survival. Conclusion: Fine-UNETR enables accurate automated whole-body PSMA lesion segmentation and tumor burden quantification. Performance on an external dataset demonstrates robustness despite evidence of domain shift. AI-derived biomarkers significantly stratified overall survival in a pre-radioligand therapy cohort, supporting the clinical utility of automated PSMA PET/CT quantification for prognostication.

2606.17479 2026-06-17 eess.SP 新提交

A Miniaturized Dynamic Array for Antenna-Level Physical Layer Security

用于天线级物理层安全的小型化动态阵列

Sheng Huang, Jacob R. Randall, Cory Hilton, Jeffrey A. Nanzer

AI总结 提出一种基于方向调制的紧凑动态全向阵列,通过单射频输入和开关控制四元印刷曲折线单极子阵列,在E面实现角度选择性信息恢复,H面保持全向覆盖。

Comments 14 pages, 11 figures

详情
AI中文摘要

提出了一种紧凑的动态全向阵列,用于通过方向调制实现天线级物理层安全。与基于相控阵波束合成或多个射频链的传统方向调制发射机不同,所提出的架构使用单个射频输入和开关控制的四元印刷曲折线单极子阵列,工作于5.05 GHz。状态相关的激励在辐射场中引入可控的幅度和相位扰动,产生角度相关的星座畸变和误码率行为。可靠的信息恢复被限制在E面的窄边射区域,而H面保持准静态全向,提供完整的360度信息可恢复区域。该天线在单层Rogers RO4350B基板上实现,紧凑尺寸为0.57 x 1.11 λ₀²。使用基于商用射频元件的四路开关网络进行实验验证。采用16-QAM在5.05 GHz的通信测量表明,在BER ≤ 10⁻³准则下,校准开关模式的E面信息波束宽度为30至36度,而测量的H面未观察到误码,信噪比保持在约33 dB以上。还利用馈电相位偏移来引导BER定义的信息可恢复扇区,展示了使用相同天线级开关机制的信息波束转向。这些结果表明,紧凑的天线级方向调制可以在一个主平面内提供角度选择性信息恢复,同时在正交平面保持全向覆盖。

英文摘要

A compact dynamic omnidirectional array is proposed for antenna-level physical-layer security through directional modulation. Unlike conventional directional-modulation transmitters based on phased-array beam synthesis or multiple RF chains, the proposed architecture uses a single RF input and a switching-controlled four-element printed meander-line monopole array operating at 5.05 GHz. The state-dependent excitation introduces controllable magnitude and phase perturbations in the radiated field, producing angle-dependent constellation distortion and bit error rate behavior. Reliable information recovery is confined to a narrow broadside region in the E-plane, whereas the H-plane remains quasi-static and omnidirectional, providing a full 360-degree information-recoverable region. The antenna is implemented on a single-layer Rogers RO4350B substrate with a compact footprint of 0.57 x 1.11 lambda_0^2. A four-path switching network based on commercial RF components is used for experimental validation. Communication measurements using 16-QAM at 5.05 GHz demonstrate BER-defined E-plane information beamwidths of 30 to 36 degrees for calibrated switching modes under a BER <= 10^-3 criterion, while no bit errors are observed in the measured H-plane and the SNR remains above approximately 33 dB. Feed-phase offsets are also used to steer the BER-defined information-recoverable sector, demonstrating information-beam steering with the same antenna-level switching mechanism. These results show that compact antenna-level directional modulation can provide angularly selective information recovery in one principal plane while preserving omnidirectional coverage in the orthogonal plane.

2606.17439 2026-06-17 eess.SP 新提交

Two-Stage IQ Imbalance Estimation and Compensation for AFDM Systems

AFDM系统的两级IQ不平衡估计与补偿

Zhenfeng Huang, Yitong Liu, Yuping Yan, Hongwen Yang

AI总结 针对AFDM系统中的IQ不平衡问题,提出两级估计与补偿方法:先利用前导码迭代估计时不变参数,再结合BEM信道估计与改进LMMSE检测器抑制干扰,实现快速收敛和近理想误码率。

Comments submitted to IEEE Wireless Communications Letters

详情
AI中文摘要

仿射频分复用(AFDM)是一种新兴的基于啁啾信号的多载波波形,在双选择性信道中具有强分集性,但实际系统存在发射机和接收机IQ不平衡,导致镜像干扰和性能下降。本文提出了一种用于AFDM系统的两级IQ不平衡估计与补偿方法。首先,利用前导码辅助的迭代算法,通过利用IQ不平衡参数的慢时变特性来估计时不变参数。然后,一种联合信道估计与数据检测方案将基于基扩展模型(BEM)的信道估计与改进的LMMSE检测器相结合,用于干扰抑制。仿真结果表明,该方法收敛速度快,误码率性能接近理想情况。

英文摘要

Affine frequency division multiplexing (AFDM) is an emerging chirp-based multicarrier waveform with strong diversity in doubly selective channels, but practical systems suffer from transmitter and receiver IQ imbalance, causing image interference and performance degradation. This paper proposes a two-stage IQ imbalance estimation and compensation method for AFDM systems. First, a preamble-assisted iterative algorithm estimates the time-invariant IQ imbalance parameters by exploiting their slowly time-varying nature. Then, a joint channel estimation and data detection scheme combines basis expansion model (BEM)-based channel estimation with an improved LMMSE detector for interference suppression. Simulations show rapid convergence and near-ideal BER performance.

2606.17382 2026-06-17 eess.SP 新提交

Automated Estimation of Equivalent Circuit Model from Impedances with Long Short-Term Memory

基于长短期记忆网络的阻抗等效电路模型自动估计

Ryoma Iki, Motoya Furugori, Noboru Katayama

AI总结 提出一种结合LSTM和卷积特征提取器的机器学习方法,直接从阻抗谱生成等效电路拓扑,无需拟合或预设元件数量,在合成数据上以77.8%准确率识别正确拓扑。

详情
AI中文摘要

电化学阻抗谱(EIS)是一种广泛使用的非破坏性电化学系统表征技术,其分析通常依赖于将测量谱拟合到等效电路模型(ECM)。然而,选择合适的ECM仍然是一个主要瓶颈:基于知识的选择需要专家判断且难以复现,而现有的自动化方法要么从固定的候选电路集中选择,要么在基因表达编程的情况下需要重复的等效电路拟合和预定的电路规模。本文提出了一种机器学习方法,通过将电路表示为符号序列,并利用长短期记忆(LSTM)网络结合卷积特征提取器生成该序列,直接从阻抗谱估计ECM。由于LSTM天然处理变长序列,该方法直接生成电路拓扑,在估计过程中无需任何拟合,也不需要对元件数量进行先验假设。引入阻抗的四次方根变换以强调对区分电路至关重要的中频特征,自适应波束搜索生成多个排序候选。在由119种电路拓扑生成的100,000个合成数据集(阻抗添加1%噪声)上评估,该方法在77.8%的情况下识别出正确拓扑作为最可能的ECM,在98.8%的情况下正确拓扑位于前五名候选之中,每个数据集的平均估计时间为17.8毫秒——比报道的基于拟合的方法快几个数量级。这些结果表明,使用神经网络直接生成拓扑是实现全自动、无需专家的ECM估计的有前景的途径。

英文摘要

Electrochemical Impedance Spectroscopy (EIS) is a widely used, non-destructive technique for characterizing electrochemical systems, and its analysis typically relies on fitting the measured spectra to an Equivalent Circuit Model (ECM). Selecting an appropriate ECM, however, remains a major bottleneck: knowledge-based selection requires expert judgment and is difficult to reproduce, while existing automated approaches either choose from a fixed set of candidate circuits or, in the case of Gene Expression Programming, require repeated equivalent-circuit fitting and a predetermined circuit scale. Here, we propose a machine learning method that estimates an ECM directly from an impedance spectrum by representing the circuit as a serialized string of symbols and generating this string with a Long Short-Term Memory (LSTM) network coupled to a convolutional feature extractor. Because the LSTM inherently handles variable-length sequences, the method produces the circuit topology directly, without any fitting during estimation nor prior assumption for the number of elements. A fourth-root transformation of the impedance is introduced to emphasize the mid-frequency features essential for distinguishing circuits, and an adaptive beam search yields multiple ranked candidates. Evaluated on 100,000 synthetic datasets generated from 119 circuit topologies with 1% added noise on impedances, the method identified the correct topology as the most probable ECM in 77.8% of cases and among the top five candidates in 98.8% of cases, with an average estimation time of 17.8 milliseconds per dataset - several orders of magnitude faster than reported fitting-based approaches. These results indicate that direct topology generation with a neural network is a promising route toward fully automated, expert-independent ECM estimation.

2606.17337 2026-06-17 eess.AS 新提交

From Signals to Patterns: Non-Invasive Tuberculosis Detection from Cough Audio using Bandit Weighted Hyperbolic Prototypes

从信号到模式:使用Bandit加权双曲原型从咳嗽音频进行非侵入性结核病检测

Mohd Mujtaba Akhtar, Girish, Sanjam Wadhwa, Muskaan Singh, Ning Ma

AI总结 提出COBALT框架,融合频谱特征与语音基础表示,通过码本对齐双曲原型和Bandit可靠性加权,在CODA TB DREAM挑战基准上实现最优性能。

Comments Accepted to INTERSPEECH 2026

详情
AI中文摘要

在本研究中,我们聚焦于基于咳嗽的结核病筛查(CBTS),并假设将语音/音频基础表示与频谱描述符融合将产生更强的筛查性能。我们预期这种融合将揭示互补优势:频谱特征保留了咳嗽信号中细粒度的短时声学细节,而基础嵌入则捕获了从大规模预训练中学到的高层时间和事件级模式。为此,我们提出了COBALT,一种基于码本对齐双曲原型和bandit式可靠性加权的新型融合框架,以有效整合异构表示。使用CODA TB DREAM挑战基准,COBALT始终优于单个表示和拼接基线,在融合MFCC与PaSST时实现了最佳整体性能,从而在该基准上建立了新的最先进水平。

英文摘要

In this study, we focus on cough-based tuberculosis screening (CBTS) and hypothesize that fusing speech/audio foundation representations with spectral descriptors will yield stronger screening performance. We expect this fusion to reveal complementary strengths: spectral features preserve fine-grained short-time acoustic detail in cough signals, while foundation embeddings capture higher-level temporal and event-level patterns learned from large-scale pretraining. To this end, we propose COBALT, a novel fusion framework based on codebook-aligned hyperbolic prototypes and bandit-style reliability weighting to integrate heterogeneous representations effectively. Using the CODA TB DREAM Challenge benchmark, COBALT consistently outperforms individual representations and a concatenation baseline, achieving the best overall performance when fusing MFCC with PaSST thereby establishing a new state-of-the-art on the benchmark.

2606.17333 2026-06-17 eess.SP 新提交

Communication Modeling of Long-Distance Abscisic Acid Signaling in Plant Vascular Systems

植物维管系统中长距离脱落酸信号传导的通信建模

Necati Kagan Erkek, Hani Ballouz, Radin Monshian Motlagh

AI总结 综述脱落酸(ABA)的生物合成、长距离运输及实验量化方法,提出基于分子通信的ABA传输模型,通过MATLAB布朗运动模拟评估释放量和接收器半径对检测信号的影响。

Comments 16 pages

详情
AI中文摘要

脱落酸(ABA)是一种关键的植物激素,用于协调对干旱、盐碱、冷胁迫、病原体攻击、创伤和发育老化的响应。本文综述了增加ABA生物合成的生物刺激、主要产生部位和途径,以及ABA通过植物维管组织的长距离运动。然后讨论了实验量化方法,包括带电子捕获检测的气液色谱法和带紫外检测的高效液相色谱法。最后,本文提出了一种受分子通信启发的ABA传输模型,其中根侧ABA释放被表示为发射器,木质部路径被表示为有界通道,大豆组织被表示为接收器。使用MATLAB布朗运动模拟来评估释放分子数量和接收器半径对检测到的ABA信号的影响。结果表明,更高的释放量产生更平滑和更强的接收趋势,而更大的接收器增加分子捕获概率。

英文摘要

Abscisic acid (ABA) is a central plant hormone for coordinating responses to drought, salinity, cold stress, pathogen attack, wounding, and developmental aging. This paper reviews the biological stimuli that increase ABA biosynthesis, the main production sites and pathways, and the long-distance movement of ABA through plant vascular tissues. It then discusses experimental quantification approaches, including gas-liquid chromatography with electron-capture detection and high-performance liquid chromatography with ultraviolet detection. Finally, the paper presents a molecular-communication-inspired model of ABA transport in which root-side ABA release is represented as a transmitter, the xylem pathway as a bounded channel, and soybean tissue as a receiver. MATLAB Brownian-motion simulations are used to evaluate the effects of released molecule quantity and receiver radius on the detected ABA signal. The results show that higher release quantities produce smoother and stronger reception trends, while larger receivers increase molecule-capture probability.

2606.17332 2026-06-17 eess.SP 新提交

Self-Calibrated Indoor Tracking from Backscatter Fiducials under NLOS Transmitter Illumination

非视距发射器照明下基于反向散射标记的自校准室内跟踪

Hüseyin Yiğitler, Kalle Ruttik, Jingyi Liao, Alexander Sheverdyaev, Riku Jäntti

AI总结 针对发射器-标记链路非视距的走廊场景,提出基于网格的惩罚似然跟踪器,联合估计接收路径、对数距离斜率和标记偏移,并利用代理校准路径进行残差校正,实现无需实测校准坐标的室内定位。

Comments 6 pages, 2 figures, 1 table, submitted to 16th International Conference on Indoor Positioning and Indoor Navigation (IPIN 2026)

详情
AI中文摘要

本文研究了在直接发射器照明之外的走廊段中,基于壁挂式反向散射标记的室内跟踪。在测量设置中,发射器到标记的链路是非视距的,而沿着走廊的标记到接收器链路主要是视距的。主要挑战在于有效标记响应依赖于部署环境,因此固定的校准链路预算不可靠。因此,我们使用基于网格的惩罚似然跟踪器,直接从接收功率中提取接收器路径、拟合的对数距离斜率参数和标记特定偏移量。得到的路径随后可重复用作残差图校正的代理校准坐标,而使用实测校准坐标的相同校正仅作为参考报告。在一个短的四标记走廊段上,无需实测校准坐标,该双频跟踪器的中位误差为0.52米,代理残差校正将其改善至0.46米。使用实测校准坐标时,相同校正和类似RADAR的指纹参考均达到0.31米。因此,主要剩余限制在于代理校准路径的质量,而非结构化观测模型本身。

英文摘要

This paper studies indoor tracking from wall-mounted backscatter fiducials in corridor segments outside direct transmitter illumination. In the measured setup, the transmitter-to-fiducial links are NLOS, whereas the fiducial-to-receiver links along the corridor are largely LOS. The main challenge is that the effective fiducial response is deployment-dependent, so a fixed calibrated link budget is not reliable. We therefore use a grid-based penalized-likelihood tracker that profiles the receiver path, a fitted log-distance slope parameter, and fiducial-specific offsets directly from received powers. The resulting paths can then be reused as surrogate calibration coordinates for residual-map correction, while the same correction with measured calibration coordinates is reported only as a reference. On a short four-fiducial corridor segment, the profiled dual-band tracker gives a 0.52 m median error without measured calibration coordinates, and surrogate residual correction improves this to 0.46 m. With measured calibration coordinates, the same correction and a RADAR-style fingerprint reference both reach 0.31 m. The main remaining limitation is therefore the quality of the surrogate calibration paths rather than the structured observation model itself.

2606.17325 2026-06-17 eess.SP 新提交

Backscatter Assisted Indoor NLOS Positioning

反向散射辅助的室内非视距定位

Kalle Ruttik, Hüseyin Yiğitler, Jingyi Liao, Alexander Sheverdyaev, Riku Jäntti

AI总结 利用被动反向散射设备作为虚拟锚点,通过非相干功率域建模和走廊约束最大后验跟踪器,实现亚米级室内非视距连续定位。

Comments 6 pages, 5 figures, accepted by IEEE International Symposium on Personal, Indoor and Mobile Radio Communications (PIMRC) 2026

详情
AI中文摘要

被动反向散射设备(BD)可以通过作为虚拟锚点来启用室内非视距(NLOS)定位,其多普勒分离特征在标准信道估计中可观测。本文研究在走廊环境中使用非相干功率域公式进行连续用户设备(UE)跟踪,该公式避免了BD相位同步,并对残余载波偏移和强多径保持鲁棒性。BD相关的测量通过具有未知BD特定偏移的对数距离定律建模,这使得无源异步设备无需发射功率校准即可用作锚点。基于该模型,我们开发了一个具有运动正则化和Huber鲁棒估计的走廊约束最大后验(MAP)跟踪器。在射线追踪启发的仿真中,该方法实现了0.23–0.27米的中位定位误差,90百分位误差低于0.45米。在办公室走廊测量中,使用四个频率为866 MHz的无源BD,该方法达到了0.505米的聚合中位误差,并优于简单的加权平均基线。结果表明,无源异步BD可以提供实用的亚米级室内NLOS跟踪,同时保持与现有信道估计流水线和能量自主BD部署的兼容性。

英文摘要

Passive backscatter devices (BDs) can enable indoor non-line-of-sight (NLOS) positioning by serving as virtual anchors whose Doppler-separated signatures are observable in standard channel estimates. This paper studies continuous user-equipment (UE) tracking in corridor environments using a noncoherent power-domain formulation that avoids BD phase synchronization and remains robust to residual carrier offsets and strong multipath. The BD-dependent measurements are modeled by a log-distance law with unknown BD-specific offsets, which allows passive asynchronous devices to be used as anchors without transmit-power calibration. Based on this model, we develop a corridor-constrained maximum a posteriori (MAP) tracker with motion regularization and Huber-robust estimation. In ray-tracing-inspired simulations, the method achieves median positioning errors of 0.23--0.27 m with 90th-percentile errors below 0.45 m. In office-corridor measurements with four passive BDs at 866 MHz, it attains an aggregated median error of 0.505 m and outperforms a simple weighted-average baseline. The results show that passive asynchronous BDs can provide practical sub-meter indoor NLOS tracking while remaining compatible with existing channel-estimation pipelines and energy-autonomous BD deployments.

2606.17311 2026-06-17 eess.SP 新提交

Pilot-Aided MIMO Channel Identification and Linear Deconvolution in Correlated Gaussian Noise

相关高斯噪声中导频辅助的MIMO信道辨识与线性解卷积

Necati Kagan Erkek, Y. Ugur Ozcan

AI总结 针对空间相关高斯噪声下的MIMO系统,采用导频辅助的最大似然/最小二乘信道估计,并与Cramer-Rao界比较,进而利用估计信道进行数据恢复,分析训练序列长度和正则化对性能的影响。

Comments 8 pages

详情
AI中文摘要

本文提出了在空间相关高斯噪声下多输入多输出(MIMO)信道辨识和线性解卷积的导频辅助研究。分析了实值$4\ imes4$基带模型,包括无记忆和有限冲激响应信道。噪声过程由Toeplitz协方差矩阵生成,通过最大似然/最小二乘公式从导频符号估计信道,并将经验均方误差与Cramer-Rao界进行比较。然后,利用估计的信道通过最大似然迫零和线性最小均方误差解卷积进行数据符号恢复。结果表明,足够长且条件良好的导频块使信道估计器接近理论下界,而短训练间隔会导致秩和条件限制,特别是对于四抽头模型。解卷积实验进一步表明,在低信噪比和信道估计不准确的情况下,MMSE正则化提供了比非正则化迫零更稳定的逆。

英文摘要

This paper presents a pilot-aided study of multiple-input multiple-output (MIMO) channel identification and linear deconvolution under spatially correlated Gaussian noise. A real-valued $4\times4$ baseband model is analyzed for both memoryless and finite-impulse-response channels. The noise process is generated from a Toeplitz covariance matrix, the channel is estimated from pilot symbols through maximum-likelihood/least-squares formulations, and the empirical mean-square error is compared with the Cramer--Rao bound. The estimated channel is then used for data-symbol recovery through maximum-likelihood zero-forcing and linear minimum-mean-square-error deconvolution. The results show that sufficiently long and well-conditioned pilot blocks allow the channel estimator to approach the theoretical lower bound, whereas short training intervals cause rank and conditioning limitations, especially for the four-tap model. The deconvolution experiments further show that MMSE regularization provides a more stable inverse than unregularized zero forcing at low signal-to-noise ratios and for inaccurate channel estimates.

2606.17306 2026-06-17 eess.SP 新提交

Robust Beamforming Design for Secure Uplink NOMA-ISAC

安全上行NOMA-ISAC的鲁棒波束赋形设计

Azadeh Tabeshnezhad, Milad Tatar Mamaghani, A. Lee Swindlehurst, Tommy Svensson, Erik Ström

AI总结 针对上行NOMA-ISAC系统中窃听者位置不确定的安全问题,提出联合优化用户和速率与感知性能的鲁棒波束赋形方案,通过交替优化算法实现快速收敛。

详情
AI中文摘要

集成感知与通信是第六代(6G)移动网络的关键技术,能够在统一系统中联合使用通信与雷达感知。虽然ISAC在频谱效率方面带来显著优势,但也引入了新的安全挑战。特别是,感知与通信资源的联合使用可能增加窃听和信息泄露的脆弱性。本文研究一个上行非正交多址(NOMA)系统,其中基站(BS)同时接收用户数据并感知位置不确定的潜在窃听者(Eve)。为增强物理层安全,设计鲁棒感知信号以同时感知和干扰Eve。我们制定了一个联合优化问题,旨在最大化用户和速率与BS感知性能,同时保持对Eve的安全性。由于所得优化问题非凸,我们开发了一种迭代交替优化(AO)算法,将其分解为两个易处理的子问题。在第一个子问题中,利用广义特征值分解以闭式优化接收合并向量。在第二个子问题中,通过半定松弛(SDR)和逐次凸近似(SCA)联合优化发射波束赋形矩阵和感知功率。仿真结果证明了我们方案在快速收敛和资源分配方面的有效性。

英文摘要

Integrated sensing and communication is an important technology for sixth-generation (6G) mobile networks, enabling the joint use of communication and radar sensing within a unified system. While offering significant benefits in terms of spectral efficiency, ISAC introduces new security challenges. In particular, the joint use of resources for sensing and communication can increase vulnerability to eavesdropping and information leakage. In this paper, we study an uplink Non-Orthogonal Multiple Access (NOMA) system where the base station (BS) simultaneously receives user data and senses a potential eavesdropper (Eve) with uncertain location. To enhance the physical-layer security, a robust sensing signal is designed to both sense and jam Eve. We formulate a joint optimization problem that aims to maximize the users' sum rate and the BS sensing performance while maintaining security against Eve. Since the resulting optimization problem is non-convex, we develop an iterative alternating optimization (AO) algorithm that decomposes it into two tractable subproblems. In the first subproblem, the receive combining vectors are optimized in closed form using generalized eigenvalue decomposition. In the second subproblem, the transmit beamforming matrices and sensing power are jointly optimized via semidefinite relaxation (SDR) and successive convex approximation (SCA). Simulation results demonstrate the effectiveness of our solution in terms of fast convergence and resource allocation.

2606.17263 2026-06-17 eess.AS 新提交

Direction of arrival estimation from distant microphone data using single frequency filtering

基于单频滤波的远距离麦克风数据波达方向估计

Sushmita Thakallapalli, Sudarsana Reddy Kadiri, Nilesh Madhu, Suryakanth V Gangashetty

AI总结 针对窄带波达方向估计易受空间混叠影响的问题,提出基于单频滤波的语音存在时频区域互相关方法,在多种混响和噪声条件下优于现有窄带方法及部分宽带方法。

详情
AI中文摘要

在远距离麦克风中,宽带(BB)波达方向(DoA)估计方法比窄带(NB)方法更适用。由于优化函数在所有频带上的聚合,BB估计器对空间混叠具有鲁棒性,而空间混叠是处理远距离麦克风数据时的一个已知问题。在NB方法中,DoA估计利用每个频带中的局部信息,因此估计受空间混叠影响。然而,与BB方法不同,NB方法利用频率稀疏性在单个时间帧内估计多个说话者的DoA。本文开发了一种提高NB DoA估计器对空间混叠鲁棒性的方法。所提方法基于对麦克风信号进行单频滤波(SFF)获得的语音存在时频区域的互相关。选择SFF谱是因为SFF分量在时间和频率上都具有高信噪比区域,并且语音与非语音的区分在SFF域中对退化具有鲁棒性。在模拟和真实数据上,使用检测和准确度指标,在不同混响和噪声条件下,将所提NB估计器与四种最先进估计器(一个NB和三个BB)进行比较。结果表明,在所有环境中,基于SFF的NB方法优于最先进的NB方法。此外,基于SFF的方法的性能优于某些BB估计器。

英文摘要

In distant microphones, broadband (BB) methods for direction-of-arrival (DoA) estimation are more suitable than narrowband (NB) methods. Due to the aggregation of their optimization function across all frequency bands, BB estimators are robust to spatial aliasing, a known problem in processing distant microphone data. In NB methods, DoA estimation is performed by utilizing \textit{local} information in each frequency band and hence the estimation is affected by spatial aliasing. However, unlike BB methods, NB methods exploit frequency sparsity to estimate the DoAs of \textit{multiple speakers} in a \textit{single time frame}. In this article, a method to improve the robustness of a NB DoA estimator to spatial aliasing is developed. The proposed method is based on cross-correlation of speech-present time-frequency regions obtained by single frequency filtering (SFF) of the microphone signals. The SFF spectrum is chosen because SFF components have regions of high signal-to-noise ratio both in time and frequency and because speech and non-speech discrimination is robust to degradations in the SFF domain. The proposed NB estimator is compared to four state-of-the-art estimators (one NB and three BB) using detection and accuracy metrics on simulated and real-world data in different reverberation and noise conditions. The results show that in all the environments, the SFF-based NB approach outperforms the state-of-the-art NB approach. Furthermore, the performance of the SFF-based approach is better than some of the BB estimators.

2606.17258 2026-06-17 eess.AS 新提交

Single frequency filtering based multi-speaker direction of arrival estimation from stereo recordings

基于单频滤波的立体录音多说话人到达方向估计

Sushmita Thakallapalli, Sudarsana Reddy Kadiri, Nilesh Madhu, Suryakanth V Gangashetty

AI总结 提出一种基于单频滤波的到达方向估计方法,利用PHAT加权互相关处理SFF输出包络,在混响、多说话人和噪声条件下优于或媲美最佳GCC方法。

详情
AI中文摘要

从嘈杂和混响的麦克风信号中进行鲁棒的到达方向(DoA)估计仍然具有挑战性。传统的估计器如广义互相关(GCC)及其变体在短时傅里叶变换(STFT)域中操作,其中频谱特征主要反映声道特性。最近的基于单频滤波(SFF)的估计器则使用时频表示,该表示提供谐波的高频谱分辨率以及激励源事件(如类脉冲)的高时间分辨率。由于激励源特征已被证明比频谱特征对噪声和混响更鲁棒,本文提出了一种改进的基于SFF的DoA估计器,该估计器使用PHAT加权的GCC来关联麦克风通道之间的SFF输出包络。我们进一步使用公开的真实房间录音,在具有挑战性的混响、多说话人和噪声条件下,对基于SFF和最先进的基于GCC的估计器进行了全面评估。实验结果表明,所提出的方法和现有的基于SFF的估计器在所有测试案例中实现了优于或可媲美最佳基于GCC的估计器的检测和精度性能。我们还证明,使用语音主导的频带可以提高GCC-PHAT的鲁棒性,这激励了未来将此类加权策略纳入基于SFF的DoA估计中。

英文摘要

Robust direction-of-arrival (DoA) estimation from noisy and reverberant microphone signals remains challenging. Conventional estimators such as generalized cross-correlation (GCC) and its variants operate in the short-time Fourier transform (STFT) domain, where spectral features primarily reflect vocal-tract characteristics. Recent single frequency filtering (SFF)-based estimators instead use a time-frequency representation that provides high spectral resolution of harmonics along with high temporal resolution of excitation-source events, such as epoch-like impulses. Since excitation-source features have been shown to be more robust to noise and reverberation than spectral features, this work proposes an improved SFF-based DoA estimator that correlates the envelopes of SFF outputs across microphone channels using PHAT-weighted GCC. We further provide a comprehensive evaluation of SFF-based and state-of-the-art GCC-based estimators using publicly available real-room recordings under challenging reverberant, multi-speaker, and noise-corrupted conditions. Experimental results show that the proposed method and an existing SFF-based estimator achieve detection and accuracy performance that is superior or comparable to the best GCC-based estimator across all test cases. We also demonstrate that using speech-dominant bins improves GCC-PHAT robustness, motivating future incorporation of such weighting strategies into SFF-based DoA estimation.

2606.17254 2026-06-17 eess.AS 新提交

Synergizing Zero-Shot Cross-Lingual Alzheimer Detection with Language-Invariant Multimodal Bi-Geometric Adversarial Learning

协同零样本跨语言阿尔茨海默检测与语言不变多模态双几何对抗学习

Girish, Mohd Mujtaba Akhtar, Farhan Sheth, Muskaan Singh, Juliana Gerard, Paula McClean, Kongfatt Wong-Lin

AI总结 提出ORBIT框架,通过跨注意力融合、多语言对抗和球面-双曲几何学习实现零样本跨语言阿尔茨海默病检测,多模态融合优于单模态基线。

Comments Accepted to INTERSPEECH 2026

详情
AI中文摘要

在这项工作中,我们研究了基于语音的零样本跨语言阿尔茨海默病检测(SADD)。我们假设,通过融合多语言语音和文本预训练模型来学习语言不变的多模态表示,对于可靠地迁移到未见过的语言至关重要,因为这两种模态捕捉了认知障碍的互补声学和语言标记,而对抗学习抑制了语言特定的混淆因素。零样本跨语言评估的实验结果证实了这一假设,表明多模态融合始终优于单模态基线。为此,我们提出了ORBIT,一个新颖的框架,它结合了跨注意力融合、多语言对抗器以及互补的球面-双曲几何学习与共识聚类。在各种设置下,与单模态模型和基于简单拼接的融合基线相比,ORBIT实现了最强的性能。

英文摘要

In this work, we study zero-shot cross-lingual speech-based Alzheimer's disease detection (SADD). We hypothesize that learning language-invariant multimodal representations by fusing multilingual speech and text pretrained models is essential for reliable transfer to unseen languages, as the two modalities capture complementary acoustic and linguistic markers of cognitive impairment while adversarial learning suppresses language-specific confounds. Empirical results in zero-shot cross-lingual evaluation substantiate the hypothesis, showing that multimodal fusion consistently outperforms unimodal baselines. To this end, we propose ORBIT, a novel framework that combines cross-attentive fusion, multi-tap language adversaries, and complementary spherical--hyperbolic geometric learning with consensus clustering. Across settings, ORBIT achieves the strongest performance compared to unimodal models and simple concatenation-based fusion baselines.