Multi-modal

# Multi-modal

MNN-LLM Android App

MNN LLM Android App

MNN-LLM is an efficient inference framework designed to optimize and accelerate the deployment of large language models on mobile devices and local PCs. It addresses high memory consumption and computational cost issues through model quantization, hybrid storage, and hardware-specific optimizations. MNN-LLM excels in CPU benchmark tests with significant speed improvements, making it ideal for users who need privacy protection and efficient inference.

Artificial intelligence

Kimi-VL

Kimi-VL is an advanced expert-mixed visual language model designed for multi-modal reasoning, long-context understanding, and powerful agent capabilities. This model excels in several complex domains, boasting efficient 2.8B parameters while exhibiting outstanding mathematical reasoning and image understanding capabilities. Kimi-VL sets a new standard for multi-modal models with its optimized computational performance and ability to handle long inputs.

EgoLife

EgoLife is an AI assistant project focused on long-term, multi-modal, multi-view daily life. The project generated approximately 50 hours of video data by recording the shared living experiences of six volunteers for a week, covering daily activities and social interactions. Its multi-modal data (including video, gaze, and IMU data) and multi-view camera system provide rich contextual information for AI research. Furthermore, the project introduces the EgoRAG framework for addressing long-term context understanding tasks, advancing AI's capabilities in complex environments.

Migician

Migician is a multi-modal large language model developed by the Natural Language Processing Laboratory of Tsinghua University, focusing on multi-image localization tasks. By introducing an innovative training framework and the large-scale MGrounding-630k dataset, the model significantly improves the accuracy of localization in multi-image scenarios. It not only surpasses existing multi-modal large language models but also outperforms larger 70B models in performance. The main advantages of Migician lie in its ability to handle complex multi-image tasks and provide free-form localization instructions, making it have important application prospects in the field of multi-image understanding. The model is currently open-source on Hugging Face for researchers and developers to use.

Magma-8B

Magma-8B is a foundational multi-modal AI model developed by Microsoft, specifically designed for researching multi-modal AI agents. It integrates text and image inputs to generate text outputs and possesses visual planning and agent capabilities. The model utilizes Meta LLaMA-3 as its language model backbone and incorporates a CLIP-ConvNeXt-XXLarge vision encoder. It can learn spatiotemporal relationships from unlabeled video data, exhibiting strong generalization capabilities and multi-task adaptability. Magma-8B excels in multi-modal tasks, particularly in spatial understanding and reasoning. It provides a powerful tool for multi-modal AI research, advancing the study of complex interactions in virtual and real-world environments.

MILS

MILS is an open-source project released by Facebook Research, designed to demonstrate the capabilities of large language models (LLMs) in handling visual and auditory tasks without any prior training. This technology leverages pre-trained models and optimization algorithms to automatically generate descriptions for images, audio, and video. This breakthrough offers new insights into the development of multi-modal AI, showcasing the potential of LLMs in cross-modal tasks. The model is primarily targeted at researchers and developers, providing them with a powerful tool to explore multi-modal applications. Currently, this project is free and open-source, aimed at advancing academic research and technological development.

Janus-Pro-1B

Janus-Pro-1B is an innovative multi-modal model that focuses on unified multi-modal understanding and generation. By utilizing separate visual encoding paths, it addresses the conflict seen in traditional methods for understanding and generation tasks, all while maintaining a single unified Transformer architecture. This design not only enhances the model’s flexibility but also ensures outstanding performance across multi-modal tasks, often surpassing models tailored for specific tasks. Built on the DeepSeek-LLM-1.5b-base/DeepSeek-LLM-7b-base architectures, the model employs SigLIP-L as its visual encoder, supports 384x384 image inputs, and utilizes a specialized image generation tokenizer. Its open-source nature and flexibility position it as a strong candidate for next-generation multi-modal models.

Doubao-1.5-pro

Developed by the Doubao team, Doubao-1.5-pro is a high-performance sparse MoE (Mixture of Experts) large language model. This model achieves an excellent balance between model performance and inference performance through an integrated training-inference design. It excels in various public evaluation benchmarks, showcasing significant advantages in inference efficiency and multi-modal capabilities. The model is suitable for scenarios that require efficient inference and multi-modal interaction, such as natural language processing, image recognition, and speech interaction. Its technical foundation is based on the sparse activation MoE architecture, which optimizes activation parameter ratios and training algorithms to achieve higher performance leverage than traditional dense models. Additionally, it supports dynamic parameter adjustment to cater to diverse application scenarios and cost requirements.

FlagAI

FlagAI, launched by the Beijing Academy of Artificial Intelligence, is a comprehensive and high-quality open-source project that integrates various mainstream large model algorithm technologies worldwide, as well as multiple large model parallel processing and training acceleration techniques. It supports efficient training and fine-tuning, aiming to lower the barriers to large model development and application while improving development efficiency. FlagAI covers several prominent models in various fields, such as language models like OPT and T5, vision models like ViT and Swin Transformer, and multi-modal models like CLIP. The Academy also continuously contributes the results of projects 'Wudao 2.0' and 'Wudao 3.0' to FlagAI. This project has been incorporated into the Linux Foundation, attracting global research talents for joint innovation and contribution.

Model Training and Deployment

stable-diffusion-3.5-large-turbo

Stable Diffusion 3.5 Large Turbo

Stable Diffusion 3.5 Large Turbo is a multi-modal diffusion transformer (MMDiT) model for text-to-image generation, employing Adversarial Diffusion Distillation (ADD) technology to enhance image quality, layout, understanding of complex prompts, and resource efficiency, with a particular focus on reducing inference steps. This model excels in image generation, capable of understanding and generating complex text prompts, making it suitable for various image generation scenarios. It is published on the Hugging Face platform under the Stability Community License, allowing for free use by researchers, non-commercial use, and organizations or individuals with annual revenue under $1 million.

Image Generation

stable-diffusion-3.5-large

Stable Diffusion 3.5 Large

Stable Diffusion 3.5 Large is a multi-modal diffusion transformer (MMDiT) model developed by Stability AI for generating images from text. The model shows significant improvements in image quality, layout, understanding complex prompts, and resource efficiency. It employs three fixed pretrained text encoders and enhances training stability through QK normalization techniques. Additionally, the model utilizes synthesized and filtered publicly available data in its training data and strategies. The Stable Diffusion 3.5 Large model is free for research, non-commercial use, and commercial use for organizations or individuals with annual revenues under $1 million, in compliance with community licensing agreements.

Image Generation

Llama 3.2

Llama 3.2 is a series of large language models (LLMs) pre-trained and fine-tuned on multilingual text models of sizes 1B and 3B, and models for text and image input/output at sizes 11B and 90B. These models are designed for developing high-performance and efficient applications. The models of Llama 3.2 can run on mobile devices and edge devices, support multiple programming languages, and can be used to build agent applications through the Llama Stack.

Data-Juicer

Data-Juicer is a comprehensive multimodal data processing system aimed at delivering higher quality, richer, and more digestible data for large language models (LLMs). It offers a systematic and reusable data processing library, supports collaborative development between data and models, allows rapid iteration through a sandbox lab, and provides features like data and model feedback loops, visualization, and multidimensional automated evaluation, helping users better understand and improve their data and models. Data-Juicer is actively maintained and regularly enhanced with more features, data recipes, and datasets.

SEED-Story

SEED-Story is a multi-modal long-form story generation model based on a large language model (MLLM). It can generate rich and coherent narrative texts and consistent visual images based on user-provided pictures and text. It represents the cutting-edge technology of artificial intelligence in the fields of creative writing and visual art, with the capability to produce high-quality, multi-modal story content, offering new possibilities for the creative industry.

AI Story Writing

Indexify

Indexify is an open-source data framework featuring a real-time extraction engine and pre-built extraction adapters, enabling reliable data extraction from various unstructured data sources like documents, presentations, videos, and audio. It supports multi-modal data, offers advanced embedding and chunking techniques, and allows users to create custom extractors using the Indexify SDK. Indexify empowers LLM applications to access the most accurate and up-to-date data by supporting semantic search and SQL queries for images, videos, and PDFs. Moreover, Indexify facilitates prototyping when running locally and utilizes pre-configured Kubernetes deployment templates in production environments for automatic scaling and handling of large data volumes.

TalkWithGemini

TalkWithGemini is a cross-platform application that supports free, one-click deployment. Users can interact with the Gemini model through this application, including image recognition and voice conversation, enhancing work efficiency.

AI Conversational Agents

Video-MME

Video-MME is a benchmark for evaluating the performance of Multi-Modal Large Language Models (MLLMs) in video analysis. It fills the gap in existing evaluation methods regarding the ability of MLLMs to process continuous visual data, providing researchers with a high-quality and comprehensive evaluation platform. The benchmark covers videos of different lengths and evaluates core MLLM capabilities.

AI video analysis

OpenCompass Multi-modal Leaderboard

Opencompass Multi Modal Leaderboard

The OpenCompass multi-modal leaderboard is a real-time platform for evaluating and ranking different multi-modal models (VLMs). It calculates the average score of models based on 8 multi-modal benchmarks and provides detailed performance data. The platform only includes open-source VLMs or publicly available APIs, aiming to help researchers and developers understand the latest advancements and performance of current multi-modal models.

AI information platform

GPT4o (Omni)

GPT4 Omni is a brand new model capable of processing text, vision, and audio, with multi-modal functionality. It boasts revolutionary capabilities in voice, but also excels in text, image, and audio processing. The advantage of GPT4 Omni lies in its ability to simultaneously process and generate multiple primary modalities, with a faster response time.

Reka Core

Reka Core is a GPT-4 level multi-modal large language model (LLM) with powerful contextual understanding of images, videos, and audio. It is one of the only two commercially available comprehensive multi-modal solutions on the market. Core excels in multi-modal understanding, reasoning capabilities, coding and Agent workflows, multi-language support, and deployment flexibility.

Griffon

Griffon is the first high-resolution (over 1K) LVLM with localization capabilities, able to describe everything in the region of your interest. In its latest version, Griffon supports visual language grounding. You can input an image or some descriptions. Griffon excels in REC, object detection, object counting, visual/phrase localization, and REG. Pricing: Free trial.

AI image detection and recognition

Any GPT

AnyGPT is a unified large-scale language model that employs discrete representations for the uniform processing of various modalities, including voice, text, images, and music. AnyGPT can be trained stably without modifying the architecture or training paradigm of existing large-scale language models. It relies entirely on data-level preprocessing, which facilitates the seamless integration of new modalities into the language model, akin to the addition of a new language. We have constructed a text-centric multi-modal dataset for multi-modal alignment pre-training. Utilizing generative models, we have created the first large-scale multi-modal instruction dataset from any modality to any modality. It consists of 108,000 multi-turn dialogue examples with different modalities intertwined, enabling the model to handle combinations of any modal input and output. Experimental results indicate that AnyGPT can facilitate multi-modal dialogues from any modality to any modality and achieve performance comparable to dedicated models across all modalities, demonstrating that discrete representations can be effectively and conveniently used for unifying multiple modalities in language models.

Multi-modal Large Language Models

Multi Modal Large Language Models

This tool aims to assess the generalization ability, trustworthiness, and causal reasoning abilities of the latest proprietary and open-source MLLMs through qualitative research from four modalities: text, code, images, and videos. This is done to increase the transparency of MLLMs. We believe these attributes are representative factors defining the reliability of MLLMs, supporting various downstream applications. Specifically, we evaluated closed-source GPT-4 and Gemini, as well as 6 open-source LLMs and MLLMs. Overall, we evaluated 230 manually designed cases, with qualitative results summarized into 12 scores (i.e., 4 modalities multiplied by 3 attributes). In total, we revealed 14 empirical findings that contribute to understanding the capabilities and limitations of proprietary and open-source MLLMs, enabling more reliable support for multi-modal downstream applications.

AI Model Evaluation

VCoder

VCoder is an adapter that can improve the performance of multi-modal large language models on object-level visual tasks by using auxiliary perception modes as control input. VCoder LLaVA is built based on LLaVA-1.5. VCoder does not fine-tune the parameters of LLaVA-1.5, so its performance on general question answering benchmarks is the same as LLaVA-1.5. VCoder has been benchmarked on the COST dataset and has achieved good performance on semantic, instance, and panoramic segmentation tasks. The authors also released the model's detection results and pre-trained models.

Google Gemini.co

Google Gemini.co

Google Gemini is a multi-modal AI model developed by DeepMind that can process text, audio, and images. It comes in three versions: Ultra, Pro, and Nano, each tailored for different task complexities. Gemini has excelled in AI benchmark tests, is optimized for various devices, and has undergone safety and bias testing, adhering to responsible AI practices. It will be integrated into Google products and made available through Google AI Studio and Google Cloud Vertex AI.

CLoT

CLoT is an innovative tool designed to explore the creative capabilities of large language models. It challenges users' thinking by generating humorous responses, helping them discover the potential of language models. CLoT is not limited to humor generation and can also be used for other creative tasks. Please visit our official website for more information.

AI Conversational AI Agents

Fuyu-8B

Fuyu-8B is a multi-modal text-to-image and image-to-text conversion model trained by Adept AI. It features a simplified architecture and training process, making it easy to understand, extend, and deploy. Designed for digital agents, it can support any image resolution, answer questions about charts and graphs, answer UI-based questions, and perform fine-grained localization on screen images. It is fast-responding, capable of processing large images within 100 milliseconds. While optimized for our use cases, it performs well on standard image understanding benchmarks such as visual question answering and natural image captioning. Please note that the model we release is a base model, and we encourage you to fine-tune it for specific use cases, such as lengthy captions or multimodal chat. In our experience, the model performs well for few-shot learning and fine-tuning for various use cases.

Kosmos-2

Kosmos-2 is a multi-modal large language model that can associate natural language with various input forms like images and videos. It can be used for tasks such as phrase localization, referential understanding, referential expression generation, image description, and visual question answering. Kosmos-2 is trained and evaluated using the GRIT dataset, which contains a large amount of image-text pairs. Kosmos-2's strength lies in its ability to associate natural language with visual information, thereby enhancing model performance.

MagicAvatar

MagicAvatar is a multi-modal framework that can convert various input modes (text, video, and audio) into motion signals, thereby generating/animating avatars. It can create avatars through simple text prompts and also create avatars that follow given movements based on provided source videos. It can also animate avatars with specific themes. MagicAvatar's strength lies in its ability to combine multiple input modes to generate high-quality avatars and animations.

AI head portrait generation

Featured AI Tools

Flow AI

Flow is an AI-driven movie-making tool designed for creators, utilizing Google DeepMind's advanced models to allow users to easily create excellent movie clips, scenes, and stories. The tool provides a seamless creative experience, supporting user-defined assets or generating content within Flow. In terms of pricing, the Google AI Pro and Google AI Ultra plans offer different functionalities suitable for various user needs.

Video Production

NoCode

NoCode is a platform that requires no programming experience, allowing users to quickly generate applications by describing their ideas in natural language, aiming to lower development barriers so more people can realize their ideas. The platform provides real-time previews and one-click deployment features, making it very suitable for non-technical users to turn their ideas into reality.

Development Platform

ListenHub

ListenHub is a lightweight AI podcast generation tool that supports both Chinese and English. Based on cutting-edge AI technology, it can quickly generate podcast content of interest to users. Its main advantages include natural dialogue and ultra-realistic voice effects, allowing users to enjoy high-quality auditory experiences anytime and anywhere. ListenHub not only improves the speed of content generation but also offers compatibility with mobile devices, making it convenient for users to use in different settings. The product is positioned as an efficient information acquisition tool, suitable for the needs of a wide range of listeners.

MiniMax Agent

MiniMax Agent is an intelligent AI companion that adopts the latest multimodal technology. The MCP multi-agent collaboration enables AI teams to efficiently solve complex problems. It provides features such as instant answers, visual analysis, and voice interaction, which can increase productivity by 10 times.

Multimodal technology

Tencent Hunyuan Image 2.0

Tencent Hunyuan Image 2.0

Tencent Hunyuan Image 2.0 is Tencent's latest released AI image generation model, significantly improving generation speed and image quality. With a super-high compression ratio codec and new diffusion architecture, image generation speed can reach milliseconds, avoiding the waiting time of traditional generation. At the same time, the model improves the realism and detail representation of images through the combination of reinforcement learning algorithms and human aesthetic knowledge, suitable for professional users such as designers and creators.

Image Generation

OpenMemory MCP

OpenMemory is an open-source personal memory layer that provides private, portable memory management for large language models (LLMs). It ensures users have full control over their data, maintaining its security when building AI applications. This project supports Docker, Python, and Node.js, making it suitable for developers seeking personalized AI experiences. OpenMemory is particularly suited for users who wish to use AI without revealing personal information.

FastVLM

FastVLM is an efficient visual encoding model designed specifically for visual language models. It uses the innovative FastViTHD hybrid visual encoder to reduce the time required for encoding high-resolution images and the number of output tokens, resulting in excellent performance in both speed and accuracy. FastVLM is primarily positioned to provide developers with powerful visual language processing capabilities, applicable to various scenarios, particularly performing excellently on mobile devices that require rapid response.

Image Processing

LiblibAI

LiblibAI is a leading Chinese AI creative platform offering powerful AI creative tools to help creators bring their imagination to life. The platform provides a vast library of free AI creative models, allowing users to search and utilize these models for image, text, and audio creations. Users can also train their own AI models on the platform. Focused on the diverse needs of creators, LiblibAI is committed to creating inclusive conditions and serving the creative industry, ensuring that everyone can enjoy the joy of creation.

AIbase

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

© 2025AIbase