All it takes is 5 seconds - Challenges with AI and Text-to-Speech

Publicerades: 2023-09-20

Niklas SilfverströmCo-founder

All it takes is 5 seconds - Challenges with AI and Text-to-Speech

Did you know that all it takes to replicate your voice is five seconds of audio?

Text-to-speech is a field within artificial intelligence (AI) that deals with converting text into human-like sound. Traditionally, the voices produced have been stiff and robotic, but in recent years, technology has come a long way, making it possible to recreate voices when there is a significant amount of audio data available.

As an example, just a few weeks ago, a video surfaced where Hillary Clinton voiced her support for the Republican candidate Ron DeSantis. This is an example of a so-called 'Deep Fake,' a faked video and audio file. It is very difficult to determine authenticity based solely on the video.

The development has now advanced to the point where, with the most modern technology, only a few seconds of audio are required to replicate a voice. One of the more well-known research projects in this area is VALL-E, developed by Microsoft's research division.

Remarkably, the technology not only can mimic your voice in a matter of seconds but can also replicate emotions in the voice. If you provide AI with five seconds of excited speech, it will reproduce this enthusiasm in the generated audio. This is a significant step forward for a technology that has long felt mechanical and devoid of emotion.

VALL-E is licensed for research use only, but it didn't take long after its release for several open-source projects to emerge, aiming to replicate its results.

Challenges with new technology

There are numerous exciting possibilities with this technology, which we are exploring at Klang.ai (more on this in future blog posts). However, with opportunities come challenges. The technology has already been used to carry out sophisticated fraudulent attempts.

In Dubai, an employee received a phone call from the CEO of the parent company, requesting a large sum of money to be transferred.
During the early stages of Russia's invasion of Ukraine, recordings of Volodymyr Zelensky surrendering surfaced. This, too, was staged to create false propaganda.
Several elderly individuals have received calls from younger relatives asking for money, all simulated using advanced voice synthesis.

New routines and higher data security requirements

Organizations need to provide training to create awareness of new types of scams. Everyone in the organization needs to be aware that calls can be part of a sophisticated fraudulent attempt, even if the calls themselves appear harmless. Also, make sure to use services with a strong focus on security from companies you can confirm are reputable.

Here are a couple of tips to reduce the risk of falling victim to fraud:

Do not share accounts or passwords with anyone over the phone.
Use two-factor authentication to significantly reduce the risk of account hijacking.
Call the person on the other end to confirm their identity.
Ask your providers about where and how your data is stored.

What does the future hold?

AI-driven text-to-speech is becoming increasingly accessible every day. Legislators are trying to establish frameworks and regulations for new AI technologies. The EU, for example, is working on legislation called the 'Artificial Intelligence Act,' with the aim of imposing higher requirements for transparency and risk assessment for AI models. One challenge with regulations is that they are not always effective against actors using new technology for fraud and disinformation. I believe we will see new tools and techniques to verify the authenticity of voices and to analyze audio in conversations.

Developing software that automatically identifies fake audio will be a race between different actors. I believe we will need to automatically analyze voice and text from calls to identify suspicious conversations and provide support to organizations and users to detect fraud attempts early.

But ultimately, the best defense against fraud is a vigilant organization, and here, I believe we will see technical aids that can use AI to assist.

What challenges do you see with AI and audio technology? Feel free to share your thoughts with [email protected].