Any GPT : A multi-modal large-scale language model

Any GPT

AI Model #Multi-modal #Chatbot #Speech recognition #Speech synthesis #Image generation Standard Picks Open Source

Overview :

AnyGPT is a unified large-scale language model that employs discrete representations for the uniform processing of various modalities, including voice, text, images, and music. AnyGPT can be trained stably without modifying the architecture or training paradigm of existing large-scale language models. It relies entirely on data-level preprocessing, which facilitates the seamless integration of new modalities into the language model, akin to the addition of a new language. We have constructed a text-centric multi-modal dataset for multi-modal alignment pre-training. Utilizing generative models, we have created the first large-scale multi-modal instruction dataset from any modality to any modality. It consists of 108,000 multi-turn dialogue examples with different modalities intertwined, enabling the model to handle combinations of any modal input and output. Experimental results indicate that AnyGPT can facilitate multi-modal dialogues from any modality to any modality and achieve performance comparable to dedicated models across all modalities, demonstrating that discrete representations can be effectively and conveniently used for unifying multiple modalities in language models.

Target Users :

["Engaging in multi-modal conversations","Supporting voice assistant and other applications","Creating multi-modal content"]

Total Visits： 423

Top Region： TH(100.00%)

Website Views ： 98.0K

Features

Supporting input and output of multiple modalities including voice, text, images, and music

Conducting multi-turn multi-modal intertwined conversations

Achieving the level of dedicated models across all modalities