Minmo : MinMo is a multimodal large language model designed for seamless voice interaction.

Minmo

Speech Recognition Speech to Text #\"Voice Interaction, Multimodal, Large Language Model, Artificial Intelligence\"Standard Picks Open Source

Overview :

MinMo, developed by Alibaba Group's Tongyi Laboratory, is a multimodal large language model with approximately 8 billion parameters, focused on achieving seamless voice interactions. It is trained on 1.4 million hours of diverse voice data through various stages, including speech-to-text alignment, text-to-speech alignment, speech-to-speech alignment, and full-duplex interaction alignment. MinMo achieves state-of-the-art performance across various benchmarks in speech understanding and generation, while maintaining the capabilities of text-based large language models and supporting full-duplex dialogues, enabling simultaneous bidirectional communication between users and the system. Additionally, MinMo introduces a novel and straightforward voice decoder that surpasses previous models in speech generation. Its command-following ability has been enhanced to support voice generation control based on user instructions, including details such as emotion, dialect, and speech rate, as well as mimicking specific voices. MinMo's speech-to-text latency is approximately 100 milliseconds, with theoretical full-duplex latency around 600 milliseconds, and actual latency around 800 milliseconds. The development of MinMo aims to overcome the major limitations of previous multimodal models, providing users with a more natural, smooth, and human-like voice interaction experience.

Target Users :

The target audience includes users who require efficient and natural voice interactions, such as developers of intelligent customer service systems, voice assistants, and enterprises needing voice interaction capabilities. MinMo's low latency and high command-following ability make it extremely suitable for applications that demand real-time responses and precise control over voice output, such as smart speakers and in-car voice systems. Additionally, for researchers and developers exploring multimodal interactions and voice technology, MinMo provides a powerful tool for exploration and innovation.

Total Visits： 64.0K

Top Region： CN(67.98%)

Website Views ： 52.4K

Use Cases

Chatting with MinMo in English about movies.

Conversing with MinMo in Chinese while controlling its dialect (such as Sichuan dialect, Cantonese, etc.).

Engaging in a chat with MinMo in Chinese, instructing it for emotional interaction and role-playing.

Features

Achieves current state-of-the-art performance in speech dialogue, multilingual speech recognition, multilingual speech translation, emotion recognition, speaker analysis, and audio event analysis.

Supports end-to-end voice interaction, controlling the emotional tone, dialect, speaking style of generated audio based on user commands, and mimicking specific voices with over 90% generation efficiency.

Enables full-duplex voice interaction, allowing for smooth multi-turn conversations between users and the system, while preventing background noise interference. Speech-to-text latency is approximately 100 milliseconds, with theoretical full-duplex latency around 600 milliseconds and actual latency around 800 milliseconds.

Introduces a novel and simple voice decoder that surpasses previous models in speech generation.

Overcomes major limitations of earlier aligned multimodal models through multiple stages of training, including speech-to-text alignment, text-to-speech alignment, speech-to-speech alignment, and full-duplex interaction alignment.

How to Use

1. Visit MinMo's official website or integrate it into supported applications.

2. Select the desired voice interaction mode, such as voice conversation or voice translation.

3. Issue voice commands or input voice text as prompted.

4. Observe MinMo's voice responses and adjust commands or parameters as needed.

5. Utilize MinMo's command control features to customize voice output in terms of emotion, dialect, and speech rate.