

Step R1 V Mini
Overview :
Step-R1-V-Mini is a new multimodal reasoning model launched by Jieyue Xingchen. It supports image and text input and text output, and has good instruction following and general capabilities. The model has been technically optimized for reasoning performance in multimodal collaborative scenarios. It employs multimodal joint reinforcement learning and a training method that makes full use of multimodal synthetic data, effectively improving the model's ability to handle complex chain processing in image space. Step-R1-V-Mini has performed brilliantly in several public leaderboards, particularly ranking first domestically in the MathVision visual reasoning leaderboard, demonstrating its excellent performance in visual reasoning, mathematical logic, and code. The model has been officially launched on the Jieyue AI web page and provides API interfaces on the Jieyue Xingchen open platform for developers and researchers to experience and use.
Target Users :
This product is suitable for developers, researchers, and enterprises who need to perform multimodal reasoning, such as in image recognition, location judgment, and recipe generation. It helps them efficiently process complex multimodal data, improving work efficiency and accuracy, and promoting technological innovation and development in related fields.
Use Cases
Given an image of Wembley Stadium taken by a netizen, Step-R1-V-Mini can quickly identify the elements in the image to perform location reasoning, accurately inferring the location as Wembley Stadium and providing possible competing teams.
Given a picture of a dish, Step-R1-V-Mini can accurately identify the dishes and condiments and list the specific quantities in detail, such as “300g of fresh shrimp, 2 scallions” etc.
Given a picture containing objects of different shapes, colors, and positions, Step-R1-V-Mini can identify them one by one, and perform reasoning and calculation based on the color, shape, and position of the objects, ultimately determining the number of remaining objects.
Features
Supports image and text input and text output, capable of high-precision image perception and complex reasoning tasks.
Employs multimodal joint reinforcement learning, based on the PPO reinforcement learning strategy, introducing verifiable rewards into the image space to effectively solve problems related to complex reasoning chains in image space and errors in correlation and causal reasoning that are prone to confusion.
Makes full use of multimodal synthetic data, designing numerous multimodal data synthesis chains based on environmental feedback. Through PPO-based reinforcement learning training, it simultaneously improves the model's text and visual reasoning capabilities.
Has performed brilliantly in several public leaderboards, particularly ranking first domestically in the MathVision visual reasoning leaderboard, demonstrating its excellent performance in visual reasoning, mathematical logic, and code.
Has been officially launched on the Jieyue AI web page and provides API interfaces on the Jieyue Xingchen open platform for developers and researchers to experience and use.
Possesses good instruction following and general capabilities, adaptable to various multimodal reasoning scenarios.
Provides users with accurate information such as location, recipes, and object quantities through precise image recognition and reasoning.
Continuously exploring and optimizing, bringing new hope and possibilities to the field of multimodal reasoning.
How to Use
Access the Jieyue AI web page or Jieyue Xingchen open platform.
Register and log in to the platform to obtain API interface permissions.
Select the appropriate API interface based on your needs and call it according to the document instructions.
Send the image and text data that needs to be inferred as input to the API interface.
Receive and process the inference results returned by the API, and perform subsequent operations based on the results.
Featured AI Tools

Gemini
Gemini is the latest generation of AI system developed by Google DeepMind. It excels in multimodal reasoning, enabling seamless interaction between text, images, videos, audio, and code. Gemini surpasses previous models in language understanding, reasoning, mathematics, programming, and other fields, becoming one of the most powerful AI systems to date. It comes in three different scales to meet various needs from edge computing to cloud computing. Gemini can be widely applied in creative design, writing assistance, question answering, code generation, and more.
AI Model
11.4M
Chinese Picks

Liblibai
LiblibAI is a leading Chinese AI creative platform offering powerful AI creative tools to help creators bring their imagination to life. The platform provides a vast library of free AI creative models, allowing users to search and utilize these models for image, text, and audio creations. Users can also train their own AI models on the platform. Focused on the diverse needs of creators, LiblibAI is committed to creating inclusive conditions and serving the creative industry, ensuring that everyone can enjoy the joy of creation.
AI Model
6.9M