

Whisper Diarization
Overview :
whisper-diarization is an open-source project that integrates Whisper's automatic speech recognition (ASR) capabilities, Voice Activity Detection (VAD), and speaker embedding technology. It improves the accuracy of speaker embeddings by extracting the audible portions of audio, generating transcriptions using Whisper, and correcting timestamps and alignment through WhisperX to minimize segmentation errors caused by temporal offsets. Subsequently, MarbleNet is employed for VAD and segmentation to eliminate silence, while TitaNet is used to extract speaker embeddings for identifying speakers in each segment. Finally, the results are correlated with the timestamps generated by WhisperX, determining the speaker of each word based on timestamps and realigning with a punctuation model to compensate for minor timing offsets.
Target Users :
This product is suitable for developers and researchers who require automatic speech recognition and speaker segmentation, particularly when working with multi-speaker audio files, as it significantly enhances the accuracy of transcription and segmentation.
Use Cases
Researchers use whisper-diarization for automatic transcription and speaker recognition of academic conference audio.
Developers leverage this model to add auto-generated subtitles and speaker labels to video conferencing software.
Content creators employ whisper-diarization to enhance the post-production efficiency of podcasts or video content.
Features
High-quality voice transcription combined with Whisper ASR
Excludes silence using Voice Activity Detection (VAD) technology
Speaker identification through speaker embedding technology
Correction and alignment of timestamps using WhisperX
Optimized accuracy of transcription alignment through a punctuation model
Supports batch inference to enhance processing efficiency
How to Use
1. Ensure that FFMPEG and Cython are installed as prerequisites.
2. Clone or download the whisper-diarization repository.
3. Modify `diarize.py` and `helpers.py` to adjust the WhisperX and NeMo parameters as needed.
4. Use the command line tool to run the model with the relevant parameters and audio file names.
5. Based on your system's VRAM capacity, choose to use either `diarize.py` or `diarize_parallel.py` for processing.
6. Review the output to ensure the accuracy of the transcription and speaker segmentation.
7. If you encounter issues or have suggestions for improvement, feel free to submit an issue or a pull request on GitHub.
Featured AI Tools

Openvoice
OpenVoice is an open-source voice cloning technology capable of accurately replicating reference voicemails and generating voices in various languages and accents. It offers flexible control over voice characteristics such as emotion, accent, and can adjust rhythm, pauses, and intonation. It achieves zero-shot cross-lingual voice cloning, meaning it does not require the language of the generated or reference voice to be present in the training data.
AI speech recognition
2.4M

Azure AI Studio Speech Services
Azure AI Studio is a suite of artificial intelligence services offered by Microsoft Azure, encompassing speech services. These services may include functions such as speech recognition, text-to-speech, and speech translation, enabling developers to incorporate voice-related intelligence into their applications.
AI speech recognition
270.5K