OpenAI's Whisper is the most accurate AI speech recognition tool we've tried so far

There are a few ways to transcribe an interview or a video. You could do it by hand just by listening, which will give you the best accuracy but takes by far the longest, or you could use a service or tool. For example, I used to use YouTube, let it automatically generate subtitles, save those subtitles, and edit them to fix all the problems. Now, there are various AI tools that can do an excellent job, and one such tool is OpenAI's Whisper.

To demonstrate just how well the tool works, I transcribed the most recent XDA TV video. As you can see below, it will transcribe and timestamp sections, which can easily be used as subtitles on platforms like YouTube. It works quickly, too; I used it on my M1 MacBook Pro to transcribe a 10-minute video in just over five and a half minutes.

OpenAI Whisper transcribing a video from the XDA-Developers YouTube channel

This tool is a game-changer for content creators who need to generate subtitles, people who need to transcribe interviews, or who just want to turn any kind of audio into text. I've found its accuracy incredible, and recently, I transcribed a 25-minute interview where not a single thing was transcribed incorrectly. Whisper can also translate languages in transcribed audio.

What is Whisper?

Whisper is an automatic speech recognition system that demonstrates incredible accuracy in understanding spoken words. It was built by OpenAI, presumably for use in systems like ChatGPT, where you can now converse with an AI, but the company also open-sourced Whisper so that the community could use it as well.

How OpenAI's Whisper works and was trained

How it works is fairly advanced, and it involves training on 680,000 hours of supervised data collected from the internet, a third of which was not in English. Audio is split into 30-second chunks, converted, and then passed into an encoder, and a decoder that has been trained will try to predict the corresponding text caption. Other steps take place here, too, but they're pretty technical and involve identifying the language being spoken, multilingual speech transcription, and translation to English.

How does Whisper compare to other tools?

As for how it compares to other tools, OpenAI says that Whisper makes up to 50% fewer errors than other language models, and I believe it. I have used a lot of tools over the years to try and transcribe audio, and nothing has been as accurate as Whisper for me. As I mentioned, I transcribed a 25-minute interview that came out flawlessly, which pretty much every tool struggles with.

The one thing particularly interesting about Whisper is that it's not a tool aimed at end users but rather at developers and researchers. OpenAI said the reason for open-sourcing the models and code was to "serve as a foundation for building useful applications and for further research on robust speech processing." You can still set it up and use it, but it's not really a consumer product yet.

There are multiple models that you can use when transcribing audio, and there are different vRAM requirements for each. The largest model requires 10GB of vRAM, though it's also the most accurate. There are also English-only models of each, except for the largest model, which should reduce vRAM requirements if you know the content that you're transcribing is only in English. Either way, you'll need a good GPU with enough vRAM to get it up and running.

How to use OpenAI's Whisper

Whisper from OpenAI is an open-source tool that you can run locally pretty easily by following a few tutorials. If you have a MacBook, there are some more convoluted steps to get it working, but it's not too bad, as you'll basically just need to compile a C++ version of Whisper from the source yourself. It's not an official port, but it's the only way to get it to run natively on Apple silicon. You can follow this tutorial on Medium for how to do that.

You can also just run it in Google Collab, though it's slower, or you can run it locally if you have an x86 machine. You just need to make sure you have ffmpeg installed, and you can clone the Git repository that Whisper is in and run it. Simply follow the instructions in the Whisper Git repository, and you'll be able to set up Whisper in no time. The more powerful your hardware is, the better, of course, but it will run on basically anything with enough vRAM, just taking longer if your PC is slower.