Whisper Diarization : Automatic speech recognition and speaker segmentation based on OpenAI Whisper

Whisper Diarization

AI speech recognition AI audio editing #Speech Recognition #Speaker Segmentation #Automatic Transcription #Voice Activity Detection Standard Picks Open Source

Overview :

whisper-diarization is an open-source project that integrates Whisper's automatic speech recognition (ASR) capabilities, Voice Activity Detection (VAD), and speaker embedding technology. It improves the accuracy of speaker embeddings by extracting the audible portions of audio, generating transcriptions using Whisper, and correcting timestamps and alignment through WhisperX to minimize segmentation errors caused by temporal offsets. Subsequently, MarbleNet is employed for VAD and segmentation to eliminate silence, while TitaNet is used to extract speaker embeddings for identifying speakers in each segment. Finally, the results are correlated with the timestamps generated by WhisperX, determining the speaker of each word based on timestamps and realigning with a punctuation model to compensate for minor timing offsets.

Target Users :

This product is suitable for developers and researchers who require automatic speech recognition and speaker segmentation, particularly when working with multi-speaker audio files, as it significantly enhances the accuracy of transcription and segmentation.

Total Visits： 474.6M

Top Region： US(19.34%)

Website Views ： 70.9K

Use Cases

Researchers use whisper-diarization for automatic transcription and speaker recognition of academic conference audio.

Developers leverage this model to add auto-generated subtitles and speaker labels to video conferencing software.

Content creators employ whisper-diarization to enhance the post-production efficiency of podcasts or video content.

Features

High-quality voice transcription combined with Whisper ASR

Excludes silence using Voice Activity Detection (VAD) technology

Speaker identification through speaker embedding technology

Correction and alignment of timestamps using WhisperX

Optimized accuracy of transcription alignment through a punctuation model

Supports batch inference to enhance processing efficiency

How to Use

1. Ensure that FFMPEG and Cython are installed as prerequisites.

2. Clone or download the whisper-diarization repository.

3. Modify `diarize.py` and `helpers.py` to adjust the WhisperX and NeMo parameters as needed.

4. Use the command line tool to run the model with the relevant parameters and audio file names.

5. Based on your system's VRAM capacity, choose to use either `diarize.py` or `diarize_parallel.py` for processing.

6. Review the output to ensure the accuracy of the transcription and speaker segmentation.

7. If you encounter issues or have suggestions for improvement, feel free to submit an issue or a pull request on GitHub.

Featured AI Tools

Openvoice

OpenVoice is an open-source voice cloning technology capable of accurately replicating reference voicemails and generating voices in various languages and accents. It offers flexible control over voice characteristics such as emotion, accent, and can adjust rhythm, pauses, and intonation. It achieves zero-shot cross-lingual voice cloning, meaning it does not require the language of the generated or reference voice to be present in the training data.

AI speech recognition

2.4M

Azure AI Studio Speech Services

Azure AI Studio is a suite of artificial intelligence services offered by Microsoft Azure, encompassing speech services. These services may include functions such as speech recognition, text-to-speech, and speech translation, enabling developers to incorporate voice-related intelligence into their applications.

AI speech recognition

270.5K

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

Direct Visits	51.61%	External Links	33.46%	Email	0.04%
Organic Search	12.58%	Social Media	2.19%	Display Ads	0.11%

Monthly Visits	4.92m
Average Visit Duration	393.01
Pages Per Visit	6.11
Bounce Rate	36.20%

Monthly Visits	4.92m
United States	19.34%
China	13.25%
India	9.32%
Russia	4.28%
Germany	3.63%