Comparison of Transcription Services for Swedish

There are many services that offer automatic transcription, we have tested some of them to see how good they really are at Swedish.

We have tested the following services:

Microsoft Azure Speech-to-Text
Google Cloud Speech-to-Text
OpenAI Whisper
Klang.ai

Test Setup

Usually, models are evaluated on relatively short audio files, often just a few seconds long. But in reality, you often need to transcribe audio that is up to one or several hours long.

We have selected over 10 hours of audio material in interview format, where each interview lasts up to an hour. The interviews have different speakers and vary in topic. We then compare the results from the different services with transcriptions made by professional transcribers.

Results

To measure how similar two texts are to each other, we use what is called Word Error Rate (WER). WER is a measure of how many words differ between two texts and is a common measure for evaluating transcription services.

It is clear that Klang.ai is the model that performs best with an error rate of 7.8%! It is almost twice as good as the next best alternative from Microsoft which had an error rate of 12.4%!

Why Klang.ai is Better

At Klang.ai, we have put a large part of our focus on Swedish specifically. This has made it possible for us to train the model on both more and better Swedish data than the other models. Swedish is, after all, a relatively small language, so for the large American companies, Swedish is not as prioritized.

This focus, together with our expertise in AI and machine learning, is what enables us to offer the world's best model for Swedish!

Why is OpenAI's Whisper Not Good Enough?

There is a lot of talk about OpenAI's Whisper model that has been released freely available to the public. The coolest thing about Whisper is that the model can transcribe in 100 different languages, most with relatively good quality. To be able to support so many languages, they have collected data from many different sources, but with varying quality.

The disadvantage of this is that OpenAI does not have the ability to review the results in each language and filter out poor transcriptions from the training data.

This leads to several different problems with OpenAI's Whisper:

Hallucinations in silent parts - The model writes out entire sentences that are not in the audio, often in silent parts. This is probably because the model has been trained on many videos with subtitles, probably from YouTube or other video sites. But on them, it's common with instructional videos with descriptive subtitles completely without anyone speaking, which has taught the model to write out text when it can't hear anyone talking.
Hallucinations at the end - For the same reason as above, the model often writes out "Donate to me on my Patreon", "Subtitled by some-subtitling-service.com" or similar at the end of the transcription. Something that often appears at the end of texts without being said in the video.
Loses entire sentences - The model is actually designed to handle audio snippets of up to 30 seconds. To handle longer files, the model constantly guesses how far into each 30-second block it has transcribed, and regularly jumps forward in the audio to a new 30-second block. But it's not uncommon for the model to guess wrong and jump forward too far, which causes it to lose entire sentences, sometimes up to 30 seconds at a time.

A large part of the reason why OpenAI's Whisper is so much worse than the other models above is actually due to the errors above. If you exclude all hallucinations and missed sentences, Whisper becomes almost as good as Google's and Microsoft's best models.

We at Klang.ai are proud to have the best transcription for interviews in Swedish and it can be tested for free.