Phi-3-mini-4k-instruct-onnx
P
Phi 3 Mini 4k Instruct Onnx
Overview :
Phi-3 Mini is a lightweight, advanced open-source large-scale model constructed on synthetic data from Phi-2 and filtered website data, committed to providing high-quality, inference-intensive data. The model has undergone a rigorous enhancement process, combining supervised fine-tuning and direct preference optimization to ensure strict adherence to instructions and robust security measures. This repository provides an optimized ONNX version of Phi-3 Mini, which can be accelerated on CPUs and GPUs using ONNX Runtime, supporting servers, Windows, Linux, Mac, and various platforms, with the best precision configuration tailored for each platform. ONNX Runtime's DirectML support allows developers to achieve large-scale hardware acceleration on Windows devices with AMD, Intel, and NVIDIA GPUs.
Target Users :
["- Business: Integration of Phi-3 Mini into various business applications to provide natural language processing capabilities","- Developers: Leverage the powerful generation capabilities of Phi-3 Mini to develop various language-related applications and services, such as conversational systems, Q&A systems, text generation, and data analysis","- Individual Users: Utilize Phi-3 Mini to produce high-quality natural language content to assist with writing, inquiries, and other needs"]
Total Visits: 29.7M
Top Region: US(17.94%)
Website Views : 59.6K
Use Cases
1. Integrate Phi-3 Mini into the intelligent assistant system of a business to provide customers with natural language interaction and generative services
2. Develop automatic text generation and creative assistance tools based on Phi-3 Mini to provide support for writers, content creators, and others
3. Utilize the inference capabilities of Phi-3 Mini to build data analysis and report generation systems, automatically generating analysis reports
Features
- Supports acceleration of inference on multiple hardware platforms, including: - DirectML: for Windows devices with AMD, Intel, and NVIDIA GPUs, with int4 precision achieved through AWQ quantization - FP16 CUDA: for NVIDIA GPUs, with FP16 precision - Int4 CUDA: for NVIDIA GPUs, with int4 precision achieved through RTN quantization - Int4 CPU and mobile: with int4 precision achieved through RTN quantization, providing two versions for CPUs and mobile devices to balance latency and precision
- Offers a new Generate() API for ONNX Runtime, greatly simplifying the integration process of generative AI models in applications
- Exceptional performance, up to 10 times faster than PyTorch, and up to 3 times faster than Llama.cpp
- Supports large batch, long prompt, and long output inference
- Post-quantization, the model size is small, facilitating deployment
How to Use
1. Go to the Hugging Face page to download the required ONNX model files
2. Install the ONNX Runtime and ONNX Runtime Generate() API-related software packages
3. Load the ONNX model file in the code
4. Use the ONNX Runtime Generate() API to set inference parameters, such as batch size, prompt length, etc.
5. Call the generation function and enter the text prompt
6. Retrieve the output results and perform subsequent processing
AIbase
Empowering the Future, Your AI Solution Knowledge Base
© 2025AIbase