Banks in the U.S. and Europe tout voice ID as a secure way to log into your account. I proved it’s possible to trick such systems with free or cheap AI-generated voices.
The bank thought it was talking to me; the AI-generated voice certainly sounded the same.
On Wednesday, I phoned my bank’s automated service line. To start, the bank asked me to say in my own words why I was calling. Rather than speak out loud, I clicked a file on my nearby laptop to play a sound clip: “check my balance,” my voice said. But this wasn’t actually my voice. It was a synthetic clone I had made using readily available artificial intelligence technology.
“Okay,” the bank replied. It then asked me to enter or say my date of birth as the first piece of authentication. After typing that in, the bank said “please say, ‘my voice is my password.’”
Again, I played a sound file from my computer. “My voice is my password,” the voice said. The bank’s security system spent a few seconds authenticating the voice.
“Thank you,” the bank said. I was in. Read More
Tag Archives: Audio
‘Disrespectful to the Craft:’ Actors Say They’re Being Asked to Sign Away Their Voice to AI
Motherboard spoke to multiple voice actors and advocacy organizations, some of which said contracts including language around synthetic voices are now very prevalent.
Voice actors are increasingly being asked to sign rights to their voices away so clients can use artificial intelligence to generate synthetic versions that could eventually replace them, and sometimes without additional compensation, according to advocacy organizations and actors who spoke to Motherboard. Those contractual obligations are just one of the many concerns actors have about the rise of voice-generating artificial intelligence, which they say threaten to push entire segments of the industry out of work.
The news highlights the impact of the burgeoning industry of artificial intelligence-generated voices and the much lower barrier of entry for anyone to synthesize the voices of others. Read More
Whispers of A.I.’s Modular Future
ChatGPT is in the spotlight, but it’s Whisper—OpenAI’s open-source speech-transcription program—that shows us where machine learning is going.
One day in late December, I downloaded a program called Whisper.cpp onto my laptop, hoping to use it to transcribe an interview I’d done. I fed it an audio file and, every few seconds, it produced one or two lines of eerily accurate transcript, writing down exactly what had been said with a precision I’d never seen before. As the lines piled up, I could feel my computer getting hotter. This was one of the few times in recent memory that my laptop had actually computed something complicated—mostly I just use it to browse the Web, watch TV, and write. Now it was running cutting-edge A.I.
Despite being one of the more sophisticated programs ever to run on my laptop, Whisper.cpp is also one of the simplest. If you showed its source code to A.I. researchers from the early days of speech recognition, they might laugh in disbelief, or cry—it would be like revealing to a nuclear physicist that the process for achieving cold fusion can be written on a napkin. Whisper.cpp is intelligence distilled. It’s rare for modern software in that it has virtually no dependencies—in other words, it works without the help of other programs. Instead, it is ten thousand lines of stand-alone code, most of which does little more than fairly complicated arithmetic. It was written in five days by Georgi Gerganov, a Bulgarian programmer who, by his own admission, knows next to nothing about speech recognition. Gerganov adapted it from a program called Whisper, released in September by OpenAI, the same organization behind ChatGPT and dall-e. Whisper transcribes speech in more than ninety languages. In some of them, the software is capable of superhuman performance—that is, it can actually parse what somebody’s saying better than a human can.
What’s so unusual about Whisper is that OpenAI open-sourced it, releasing not just the code but a detailed description of its architecture. They also included the all-important “model weights”: a giant file of numbers specifying the synaptic strength of every connection in the software’s neural network. In so doing, OpenAI made it possible for anyone, including an amateur like Gerganov, to modify the program. Gerganov converted Whisper to C++, a widely supported programming language, to make it easier to download and run on practically any device. This sounds like a logistical detail, but it’s actually the mark of a wider sea change. Until recently, world-beating A.I.s like Whisper were the exclusive province of the big tech firms that developed them. Read More
Researchers fear Microsoft’s ‘dangerous’ new AI voice technology
According to ArsTechnica, Microsoft has developed an AI system that is capable of using machine learning to accurately mimic the voice of anyone, complete with novel, generated sentences, based on just three seconds of audio input.
… According to the report, Microsoft engineers know this technology could be dangerous in the wrong hands, being used to create malicious “deepfakes.” A system that convincingly fakes people’s voices could do everything from discrediting celebrities or politicians with fake racist quotes, to discrediting a former spouse in a custody dispute. It could even be used to create virtual pornography of a person without their consent, or be used in wire fraud by impersonating a CEO to trick companies into transferring their money. Read More
Microsoft’s VALL-E can imitate any voice with just a three-second sample
Artificial intelligence can replicate any voice, including the emotions and tone of a speaker.
- Microsoft recently released an AI tool called VALL-E that can create convincing replications of people’s voices.
- The tool uses just a 3-second recording as a prompt to generate content.
- VALL-E can replicate the emotions of a speaker, differentiating it from several AI models.
Why AI audiobook narrators could win over some authors and readers, despite the vocal bumps
Apple and Google’s AI turn in a booming market may sound less than human and raise the ire of voiceover actors, but it has cost benefits
For the first few seconds, the narrator of Kristen Ethridge’s new romance audiobook, Shelter from the Storm, sounds like a human being. The voice is light and carefully enunciated, with the slow pacing of any audiobook narrator, as it begins: “There’s a storm coming, and her name is Hope.”
Then, something about the pacing of the words grates on the ear. It’s a little too regular, even robotic. “I know that sounds a little crazy,” the breathy voice continues, grinding out the words. “That something so destructive could be labeled with such a peaceful name.” Read More
AI-generated podcast features fake voices of Steve Jobs and Joe Rogan
The creators of podcast.ai have released a 20-minute podcast featuring artificially-generated versions of Steve Jobs and Joe Rogan. The entire interview was created using AI, with the clone of Jobs discussing Eastern mysticism, Buddhism, LSD, Google, Microsoft Windows 3, and more Read More
#audio, #fake, #podcastsGoogle’s new AI can hear a snippet of song—and then keep on playing
The technique, called AudioLM, generates naturalistic sounds without the need for human annotation.
A new AI system can create natural-sounding speech and music after being prompted with a few seconds of audio.
AudioLM, developed by Google researchers, generates audio that fits the style of the prompt, including complex sounds like piano music, or people speaking, in a way that is almost indistinguishable from the original recording. The technique shows promise for speeding up the process of training AI to generate audio, and it could eventually be used to auto-generate music to accompany videos. Read More
Who Are You (I Really Wanna Know)? Detecting Audio DeepFakes Through Vocal Tract Reconstruction
Generative machine learning models have made convincing voice synthesis a reality. While such tools can be extremely useful in applications where people consent to their voices being cloned (e.g., patients losing the ability to speak, actors not wanting to have to redo dialog, etc), they also allow for the creation of nonconsensual content known as deepfakes. This malicious audio is problematic not only because it can convincingly be used to impersonate arbitrary users, but because detecting deepfakes is challenging and generally requires knowledge of the specific deepfake generator. In this paper, we develop a new mechanism for detecting audio deepfakes using techniques from the field of articulatory phonetics. Specifically, we apply fluid dynamics to estimate the arrangement of the human vocal tract during speech generation and show that deepfakes often model impossible or highly-unlikely anatomical arrangements. When parameterized to achieve 99.9% precision, our detection mechanism achieves a recall of 99.5%, correctly identifying all but one deepfake sample in our dataset. We then discuss the limitations of this approach, and how deepfake models fail to reproduce all aspects of speech equally. In so doing, we demonstrate that subtle, but biologically constrained aspects of how humans generate speech are not captured by current models, and can therefore act as a powerful tool to detect audio deepfakes. Read More
Generating Animations From Audio With NVIDIA’s Deep Learning Tech
Check out a tool in beta called Omniverse Audio2Face that lets you quickly generate new animations.
In case you missed the news, NVIDIA has a tool in beta that lets you quickly and easily generate expressive facial animation from just an audio source using the team’s deep learning-based technology. The Audio2Face tool allows users to simplify the animation of 3D characters for a game, film, real-time digital assistants, and other projects. The toolkit lets you run the results live or bake them out. Read More