AI now sounds more like us – should we be concerned?

Several wealthy Italian businessmen received a surprising phone call earlier this year. The speaker, who sounded just like Defence Minister Guido Crosetto, had a special request: Please send money to help us free kidnapped Italian journalists in the Middle East.

But it was not Crosetto at the end of the line. He only learned about the calls when several of the targeted businessmen contacted him about them. It eventually transpired that fraudsters had used artificial intelligence (AI) to fake Crosetto’s voice.

Advances in AI technology mean it is now possible to generate ultra-realistic voice-overs and sound bytes. Indeed, new research has found that AI-generated voices are now indistinguishable from real human voices. In this explainer, we unpack what the implications of this could be.

What happened in the Crosetto case?

Several Italian entrepreneurs and businessmen received calls at the start of February, one month after Prime Minister Giorgia Meloni had secured the release of Italian journalist Cecilia Sala, who had been imprisoned in Iran.

In the calls, the “deepfake” voice of Crosetto asked the businessmen to wire around one million euros ($1.17m) to an overseas bank account, the details of which were provided during the call or in other calls purporting to be from members of Crosetto’s staff.

On February 6, Crosetto posted on X, saying he had received a call on February 4 from “a friend, a prominent entrepreneur”. That friend asked Crosetto if his office had called to ask for his mobile number. Crosetto said it had not. “I tell him it was absurd, as I already had it, and that it was impossible,” he wrote in his X post.

Crosetto added that he was later contacted by another businessman who had made a large bank transfer following a call from a “General” who provided bank account information.

“He calls me and tells me that he was contacted by me and then by a General, and that he had made a very large bank transfer to an account provided by the ‘General’. I tell him it’s a scam and inform the carabinieri [Italian police], who go to his house and take his complaint.”

Similar calls from fake Ministry of Defence officials were also made to other entrepreneurs, asking for personal information and money.

While he has reported all this to the police, Crosetto added: “I prefer to make the facts public so that no one runs the risk of falling into the trap.”

Some of Italy’s most prominent business figures, such as fashion designer Giorgio Armani and Prada co-founder Patrizio Bertelli, were targeted in the scam. But, according to the authorities, only Massimo Moratti, the former owner of Inter Milan football club, actually sent the requested money. The police were able to trace and freeze the money from the wire transfer he made.

Moratti has since filed a legal complaint to the city’s prosecutor’s office. He told Italian media: “I filed the complaint, of course, but I’d prefer not to talk about it and see how the investigation goes. It all seemed real. They were good. It could happen to anyone.”

How does AI voice generation work?

AI voice generators typically use “deep learning” algorithms, through which the AI programme studies large data sets of real human voices and “learns” pitch, enunciation, intonation and other elements of a voice.

The AI programme is trained using several audio clips of the same person and is “taught” to mimic that specific person’s voice, accent and style of speaking. The generated voice or audio is also called an AI-generated voice clone.

Using natural language processing (NLP) programmes, which instruct it to understand, interpret and generate human language, AI can even learn to understand tonal features of a voice, such as sarcasm or curiosity.

These programmes can convert text to phonetic components, and then generate a synthetic voice clip that sounds like a real human. This process is known as “deepfake”, a term that was coined in 2014 by Ian Goodfellow, director of machine learning at Apple Special Projects Group. It combines “deep learning” and “fake”, and refers to highly realistic AI images, videos or audio, all generated through deep learning.

How good are they at impersonating someone?

Research conducted by a team at Queen Mary University of London and published by the science journal PLOS One on September 24 concluded that AI-generated voices do sound like real human voices to people listening to them.

In order to conduct the research, the team generated 40 samples of AI voices – both using real people’s voices and creating entirely new voices – using a tool called ElevenLabs. The researchers also collected 40 recording samples of people’s actual voices. All 80 of these clips were edited and cleaned for quality.

The research team used male and female voices with British, American, Australian and Indian accents in the samples. ElevenLabs offers an “African” accent as well, but the researchers found that the accent label was “too general for our purposes”.

The team recruited 50 participants aged 18-65 in the United Kingdom for the tests. They were asked to listen to the recordings to try to distinguish between the AI voices and the real human voices. They were also asked which voices sounded more trustworthy.

The study found that while the “new” voices generated entirely by AI were less convincing to the participants, the deepfakes or voice clones were rated about equally realistic as the real human voices.

Forty-one percent of AI-generated voices and 58 percent of voice clones were mistaken for real human voices.

Additionally, the participants were more likely to rate British-accented voices as real or human compared to those with American accents, suggesting that the AI voices are extremely sophisticated.

More worrying, the participants tended to rate the AI-generated voices as more trustworthy than the real human voices. This contrasts with previous research, which usually found AI voices less trustworthy, signalling, again, that AI has become particularly sophisticated at generating fake voices.

Should we all be very worried about this?

While AI-generated audio that sounds very “human” can be useful for industries such as advertising and film editing, it can be misused in scams and to generate fake news.

Scams similar to the one that targeted the Italian businessmen are already on the rise. In the United States, there have been reports of people receiving calls featuring deepfake voices of their relatives saying they are in trouble and requesting money.

Between January and June this year, people all over the world have lost more than $547.2m to deepfake scams, according to data by the California-headquartered AI company Resemble AI. Showing an upward trend, the figure rose from just over $200m in the first quarter to $347m in the second.

Can video be ‘deep-faked’ as well?

Alarmingly, yes. AI programmes can be used to generate deepfake videos of real people. This, combined with AI-generated audio, means video clips of people doing and saying things they have not done can be faked very convincingly.

Furthermore, it is becoming increasingly difficult to distinguish which videos on the internet are real and which are fake.

DeepMedia, a company working on tools to detect synthetic media, estimates that around eight million deepfakes will have been created and shared online in 2025 by the end of this year.

This is a huge increase from the 500,000 that were shared online in 2023.

What else are deepfakes being used for?

Besides the phone call fraud and fake news, AI deepfakes have been used to create sexual content about real people. Most worryingly, Resemble AI’s report, which was released in July, found that advances in AI have resulted in the industrialised production of AI-generated child sexual abuse material, which has overwhelmed law enforcement globally.

In May this year, US President Donald Trump signed a bill making it a federal crime to publish intimate images of a person without their consent. This includes AI-generated deepfakes. Last month, the Australian government also announced that it would ban an application used to create deepfake nude images.