Ultravox V0 4 1 Llama 3 1 8b : Multimodal speech large language model

Ultravox V0 4 1 Llama 3 1 8b

#Speech Recognition #Speech Translation #Multimodal Model #Large Language Model Standard Picks Open Source

Overview :

fixie-ai/ultravox-v0_4_1-llama-3_1-8b is a large language model based on pre-trained Llama3.1-8B-Instruct and whisper-large-v3-turbo, capable of processing speech and text input to generate text output. The model converts input audio to embeddings using a special <|audio|> pseudo-token and generates output text. Future versions plan to expand the token vocabulary to support semantic and acoustic audio token generation, which can then be used by a vocoder to produce speech output. The model performs excellently in translation evaluation and has no preference adjustment, making it suitable for scenarios such as voice agents, speech-to-speech translation, and speech analysis.

Target Users :

The target audience is developers and enterprises who need to process speech and text data, such as professionals in speech recognition, speech translation, and speech analysis. Ultravox's multimodal processing capabilities and high performance make it an ideal choice for these fields.

Total Visits： 29.7M

Top Region： US(17.94%)

Website Views ： 45.8K

Use Cases

- As a voice agent, answer user questions.

- Perform speech-to-speech translation to assist cross-language communication.

- Analyze voice commands to execute specific tasks.

Features

- Multimodal input processing: Simultaneously processes speech and text input.

- Special token processing: Processes audio input using the <|audio|> tag.

- Text generation: Generates output text based on merged embeddings.

- Speech-to-speech translation: Suitable for speech translation between different languages.

- Speech analysis: Analyzes speech content and generates relevant text.

- Future support for acoustic audio token generation: Plans to expand functionality to support acoustic audio token generation.

- Knowledge distillation loss training: Trains the model using knowledge distillation loss to match the logits of the text-based Llama backbone network.

How to Use

1. Install necessary libraries: pip install transformers peft librosa.

2. Import libraries: import transformers, numpy as np, librosa.

3. Load the model: pipe = transformers.pipeline(model='fixie-ai/ultravox-v0_4_1-llama-3_1-8b', trust_remote_code=True).

4. Load the audio file: audio, sr = librosa.load(path, sr=16000).

5. Prepare input: Define the system role and content, build the turns list.

6. Call the model: pipe({'audio': audio, 'turns': turns, 'sampling_rate': sr}, max_new_tokens=30).