|
ИСТИНА |
Войти в систему Регистрация |
ИСТИНА ПсковГУ |
||
Speaker diarization, the task of segmenting and identifying speakers in audio recordings, is critical for applications like automatic speech recognition, transcription, and analysis of multi-speaker recordings such as meetings and podcasts. Overlapping speech, prevalent in datasets like AMI and CALLHOME, poses a significant challenge, as it complicates accurate speaker segmentation. This work addresses this issue by investigating the linearity of biometric embeddings, a property enabling the representation of overlapping speech as a linear combination of individual speaker embeddings, which is essential for robust diarization, particularly in cascaded schemes with Target-Speaker Voice Activity Detection (TSVAD). We propose a novel fine-tuning method for the ECAPA-TDNN model to enhance embedding linearity, utilizing a synthetic dataset derived from VoxCeleb and a modified loss function combining AAM-Softmax with a linearity term. Integrated into a cascaded TSVAD-based diarization framework, our approach supports both full-context and streaming modes. Experiments on standard benchmarks (AMI, DIHARD, VoxConverse) demonstrate reduced Diarization Error Rate (DER) compared to state-of-the-art methods, highlighting improved handling of overlapping speech. The proposed method bridges a gap in optimizing embedding linearity, offering practical benefits for real-world multi-speaker scenarios.