

Lookoncetohear
Overview :
LookOnceToHear is an innovative smart earphone interaction system that allows users to select the target speaker they want to hear by simply using visual recognition. This technology was nominated for Best Paper at CHI 2024. It achieves real-time speech extraction through synthetic audio mixing, head-related transfer functions (HRTFs), and binaural room impulse responses (BRIRs), providing users with a novel way to interact.
Target Users :
This product is suitable for researchers and developers who need to perform speech recognition and extraction in noisy environments. For example, it can help people with hearing impairments understand conversations better in noisy environments, or perform speech analysis and processing in multi-speaker environments.
Use Cases
In meetings, use LookOnceToHear to select and listen to the voice of a specific speaker
Help people with hearing impairments concentrate on conversations in noisy public places
In audio analysis research, used to distinguish and extract multiple sound sources
Features
Users select the desired voice by looking at the target speaker for a few seconds
Utilizes the Scaper toolkit to synthesize and generate audio mixtures
Provides a self-contained dataset and training .jams specification files
Supports real-time speech extraction and evaluation of target speech listening models
Offers model checkpoints for easy training and evaluation by users
Suitable for speech recognition and extraction in noisy environments
How to Use
Download and unzip the provided .zip file to the 'data/' directory
Run the command to initiate the training process
Use Scaper's 'generate_from_jams' function to generate audio mixtures on the .jams specification files
Download and load the target speech listening model checkpoint for evaluation
Adjust model parameters as needed to optimize performance
In practical applications, users simply need to look at the target speaker to start speech extraction
Featured AI Tools

Openvoice
OpenVoice is an open-source voice cloning technology capable of accurately replicating reference voicemails and generating voices in various languages and accents. It offers flexible control over voice characteristics such as emotion, accent, and can adjust rhythm, pauses, and intonation. It achieves zero-shot cross-lingual voice cloning, meaning it does not require the language of the generated or reference voice to be present in the training data.
AI speech recognition
2.4M

Azure AI Studio Speech Services
Azure AI Studio is a suite of artificial intelligence services offered by Microsoft Azure, encompassing speech services. These services may include functions such as speech recognition, text-to-speech, and speech translation, enabling developers to incorporate voice-related intelligence into their applications.
AI speech recognition
272.4K