Speech recognition

# Speech recognition

Moonshine Web

Moonshine Web is a simple application built with React and Vite, running on Moonshine Base, a powerful speech recognition model optimized for fast and accurate automatic speech recognition (ASR), particularly suited for resource-constrained devices. The application runs locally in the browser, utilizing Transformers.js and WebGPU for acceleration (with WASM as an alternative). Its significance lies in providing users with a serverless solution for local speech recognition, which is especially crucial for scenarios requiring swift processing of voice data.

Speech Recognition

Aixploria

Aixploria is a website focused on artificial intelligence, offering an online directory of AI tools that helps users find and select the best AI solutions to meet their needs. With a simplified design and intuitive search engine, users can easily search for various AI applications using keywords. Aixploria not only provides a list of tools but also publishes articles explaining how each AI works, helping users understand the latest trends and popular applications. Additionally, Aixploria features a 'Top 10 AI' section that is updated in real-time, allowing users to quickly learn about the top AI tools in each category. Aixploria is suitable for anyone interested in AI, whether beginners or experts, and valuable information can be found here.

AI information platform

SenseVoiceSmall

Sensevoicesmall

SenseVoiceSmall is a speech foundation model that supports multiple speech understanding capabilities, including automatic speech recognition (ASR), spoken language recognition (LID), speech emotion recognition (SER), and audio event detection (AED). After training for more than 400,000 hours on data, the model supports more than 50 languages and has a recognition performance that surpasses the Whisper model. The SenseVoiceSmall model, which is a small model, uses a non-autoregressive end-to-end framework with extremely low inference latency and handles a 10-second audio in only 70 milliseconds, which is 15 times faster than Whisper-Large. In addition, SenseVoice also provides convenient fine-tuning scripts and strategies, supports multi-concurrency request service deployment pipelines, and the client languages include Python, C++, HTML, Java, and C#.

AI speech recognition

StreamSpeech

StreamSpeech is a real-time speech-to-speech translation model based on multi-task learning. By learning translation and synchronization strategies in a unified framework, it effectively identifies the translation timing within streaming voice input, achieving a high-quality real-time communication experience. The model has demonstrated leading performance on the CVSS benchmark and can provide low-latency intermediate results, such as ASR or translation.

Any GPT

AnyGPT is a unified large-scale language model that employs discrete representations for the uniform processing of various modalities, including voice, text, images, and music. AnyGPT can be trained stably without modifying the architecture or training paradigm of existing large-scale language models. It relies entirely on data-level preprocessing, which facilitates the seamless integration of new modalities into the language model, akin to the addition of a new language. We have constructed a text-centric multi-modal dataset for multi-modal alignment pre-training. Utilizing generative models, we have created the first large-scale multi-modal instruction dataset from any modality to any modality. It consists of 108,000 multi-turn dialogue examples with different modalities intertwined, enabling the model to handle combinations of any modal input and output. Experimental results indicate that AnyGPT can facilitate multi-modal dialogues from any modality to any modality and achieve performance comparable to dedicated models across all modalities, demonstrating that discrete representations can be effectively and conveniently used for unifying multiple modalities in language models.

Speechnotes

Speechnotes is a reliable and secure web-based speech-to-text tool that can quickly and accurately transcribe audio and video recordings, as well as allow for dictation notes instead of typing, saving you time and effort. Speechnotes features voice commands for punctuation and formatting, automatic capitalization, and easy import and export options, providing you with an efficient and user-friendly dictation and transcription experience. Speechnotes has been serving millions of users since 2015.

Transcribe

Transcribe ~ Speech to Text is an iOS speech-to-text application. It leverages OpenAI's Whisper technology and Apple's Neural Engine to achieve high-precision speech recognition, directly transcribing audio and video files into readable text. It supports both offline and cloud-based recognition modes. Applicable to various speech-to-text needs, it is simple and easy to use.

AI speech-to-text

Hanami Live Translator

Hanami Live Translator

Hanami Live Translator is a real-time translation tool that captures any audio from WINDOWS speakers and microphones. It utilizes lightweight multi-process and chunk processing of audio, with each chunk taking approximately 3-5 seconds to process. The application creates a hardware loopback via low-level access, allowing it to listen to content even when the speakers are muted. It uses the soundcard library to capture audio signals, the SpeechRecognition library to convert binary audio to text, and the selenium library to simulate network calls to deepl servers for free translation. The application requires an internet connection to operate and logs all actions through the Traces.log file.

SpeechEvalPro API

Speechevalpro API

The voice evaluation API is based on independently developed education AI voice models, integrating core technologies such as voice evaluation and speech recognition. It provides high-quality, multi-dimensional Chinese and English pronunciation evaluation APIs to help customers create intelligent learning products and realize human-machine interaction. Product features include core patented technology, stable and reliable AI models, and rich evaluation dimensions, including completeness, accuracy, and fluency. Pricing strategies include free trials, professional versions, and enterprise versions. Supports various evaluation scenarios, such as homework and exams. Supports HTTP and WebSocket protocols.

Easy Save AI

Translate.video is an AI-powered video translation tool that can help users automatically translate the audio and subtitles of videos into multiple languages. Utilizing advanced speech recognition and machine translation technologies, the tool ensures efficient and accurate video content translation. Users simply need to upload a video or input a video link, select the target language, and they can quickly obtain the translated video. Translate.video also supports automatic subtitle generation and editing, making it convenient for users to make fine adjustments and proofread the subtitles. The tool offers flexible pricing plans, including various packages and payment options, to cater to diverse user needs.

TTSLabs

TTSLabs is an online voice synthesis and speech recognition service, offering high-quality, natural and fluent voice synthesis and accurate and reliable speech recognition. Through simple API calls, users can convert text to real speech and convert speech to text. TTSLabs supports multiple voice styles and multiple languages, featuring fast response and high efficiency. Pricing is flexible and transparent, suitable for both individual developers and enterprise users.

Featured AI Tools

Jules AI

Jules は、自動で煩雑なコーディングタスクを処理し、あなたに核心的なコーディングに時間をかけることを可能にする異步コーディングエージェントです。その主な強みは GitHub との統合で、Pull Request(PR) を自動化し、テストを実行し、クラウド仮想マシン上でコードを検証することで、開発効率を大幅に向上させています。Jules はさまざまな開発者に適しており、特に忙しいチームには効果的にプロジェクトとコードの品質を管理する支援を行います。

開発プログラミング

NoCode

NoCode はプログラミング経験を必要としないプラットフォームで、ユーザーが自然言語でアイデアを表現し、迅速にアプリケーションを生成することが可能です。これにより、開発の障壁を下げ、より多くの人が自身のアイデアを実現できるようになります。このプラットフォームはリアルタイムプレビュー機能とワンクリックデプロイ機能を提供しており、技術的な知識がないユーザーにも非常に使いやすい設計となっています。

開発プラットフォーム

ListenHub

ListenHub は軽量級の AI ポッドキャストジェネレーターであり、中国語と英語に対応しています。最先端の AI 技術を使用し、ユーザーが興味を持つポッドキャストコンテンツを迅速に生成できます。その主な利点には、自然な会話と超高品質な音声効果が含まれており、いつでもどこでも高品質な聴覚体験を楽しむことができます。ListenHub はコンテンツ生成速度を改善するだけでなく、モバイルデバイスにも対応しており、さまざまな場面で使いやすいです。情報取得の高効率なツールとして位置づけられており、幅広いリスナーのニーズに応えています。

腾讯混元画像 2.0

腾讯混元画像 2.0

腾讯混元画像 2.0 は腾讯が最新に発表したAI画像生成モデルで、生成スピードと画質が大幅に向上しました。超高圧縮倍率のエンコード?デコーダーと新しい拡散アーキテクチャを採用しており、画像生成速度はミリ秒級まで到達し、従来の時間のかかる生成を回避することが可能です。また、強化学習アルゴリズムと人間の美的知識の統合により、画像のリアリズムと詳細表現力を向上させ、デザイナー、クリエーターなどの専門ユーザーに適しています。

OpenMemory MCP

OpenMemoryはオープンソースの個人向けメモリレイヤーで、大規模言語モデル（LLM）に私密でポータブルなメモリ管理を提供します。ユーザーはデータに対する完全な制御権を持ち、AIアプリケーションを作成する際も安全性を保つことができます。このプロジェクトはDocker、Python、Node.jsをサポートしており、開発者が個別化されたAI体験を行うのに適しています。また、個人情報を漏らすことなくAIを利用したいユーザーにお勧めします。

オープンソース

FastVLM

FastVLM は、視覚言語モデル向けに設計された効果的な視覚符号化モデルです。イノベーティブな FastViTHD ミックスドビジュアル符号化エンジンを使用することで、高解像度画像の符号化時間と出力されるトークンの数を削減し、モデルのスループットと精度を向上させました。FastVLM の主な位置付けは、開発者が強力な視覚言語処理機能を得られるように支援し、特に迅速なレスポンスが必要なモバイルデバイス上で優れたパフォーマンスを発揮します。

ピカは、ユーザーが自身の創造的なアイデアをアップロードすると、AIがそれに基づいた動画を自動生成する動画制作プラットフォームです。主な機能は、多様なアイデアからの動画生成、プロフェッショナルな動画効果、シンプルで使いやすい操作性です。無料トライアル方式を採用しており、クリエイターや動画愛好家をターゲットとしています。

LiblibAI

LiblibAIは、中国をリードするAI創作プラットフォームです。強力なAI創作能力を提供し、クリエイターの創造性を支援します。プラットフォームは膨大な数の無料AI創作モデルを提供しており、ユーザーは検索してモデルを使用し、画像、テキスト、音声などの創作を行うことができます。また、ユーザーによる独自のAIモデルのトレーニングもサポートしています。幅広いクリエイターユーザーを対象としたプラットフォームとして、創作の機会を平等に提供し、クリエイティブ産業に貢献することで、誰もが創作の喜びを享受できるようにすることを目指しています。

AIbase

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

© 2025AIbase