StreamVoice
S
Streamvoice
Overview :
StreamVoice is a language model-based zero-lip speech conversion model that enables real-time conversion without requiring the complete source speech. It utilizes a full causal context-aware language model combined with a time-independent acoustic predictor, allowing it to alternately process semantic and acoustic features at each time step, thereby eliminating the dependency on complete source speech. To enhance the performance degradation that may arise in streaming due to incomplete context, StreamVoice employs two strategies to augment the language model's context-awareness: 1) Teacher-guided Context Prediction, where a teacher model summarizes the current and future semantic context during training, guiding the model to predict missing contexts; 2) Semantic Masking Strategy, which promotes acoustic prediction from previously damaged semantic and acoustic inputs, enhancing the contextual learning capability. Notably, StreamVoice is the first language model-based streaming zero-lip speech conversion model that does not require any future prediction. Experimental results demonstrate that StreamVoice exhibits streaming conversion capabilities while maintaining comparable zero-lip performance to non-streaming speech conversion systems.
Target Users :
StreamVoice can be used in fields such as music production, speech synthesis, and voice conversion.
Total Visits: 29.7M
Top Region: US(17.94%)
Website Views : 76.7K
Use Cases
In music production, use StreamVoice to convert a singer's voice into different singing styles.
In speech synthesis, use StreamVoice to convert text into different speaking style voices.
In voice conversion, use StreamVoice to convert a speaker's voice into different speaking styles.
Features
Real-time Zero-Lip Speech Conversion
Streaming Processing
Context-Aware Language Modeling
AIbase
Empowering the Future, Your AI Solution Knowledge Base
© 2025AIbase