The Voice Stack is improving rapidly. Systems that interact with users via speaking and listening will drive many new applications. Over the past year, I’ve been working closely with DeepLearning.AI, AI Fund, and several collaborators on voice-based applications, and I will share best practices I’ve learned in this and future letters.
Foundation models that are trained to directly input, and often also directly generate, audio have contributed to this growth, but they are only part of the story. OpenAI’s RealTime API makes it easy for developers to write prompts to develop systems that deliver voice-in, voice-out experiences. This is great for building quick-and-dirty prototypes, and it also works well for low-stakes conversations where making an occasional mistake is okay. I encourage you to try it! — Read More
Tag Archives: Voice
FCC votes to ban scam robocalls that use AI-generated voices
The Federal Communications Commission said Thursday it is immediately outlawing scam robocalls featuring fake, artificial intelligence-created voices, cracking down on so-called “deepfake” technology that experts say could undermine election security or supercharge fraud.
The unanimous FCC vote extends anti-robocall rules to cover unsolicited AI deepfake calls by recognizing those voices as “artificial” under a federal law governing telemarketing and robocalling. – Read More
Artificial intelligence model detects asymptomatic Covid-19 infections through cellphone-recorded coughs
Results might provide a convenient screening tool for people who may not suspect they are infected.
Asymptomatic people who are infected with Covid-19 exhibit, by definition, no discernible physical symptoms of the disease. They are thus less likely to seek out testing for the virus, and could unknowingly spread the infection to others.
But it seems those who are asymptomatic may not be entirely free of changes wrought by the virus. MIT researchers have now found that people who are asymptomatic may differ from healthy individuals in the way that they cough. These differences are not decipherable to the human ear. But it turns out that they can be picked up by artificial intelligence. Read More
Google’s SoundFilter AI separates any sound or voice from mixed-audio recordings
Researchers at Google claim to have developed a machine learning model that can separate a sound source from noisy, single-channel audio based on only a short sample of the target source. In a paper, they say their SoundFilter system can be tuned to filter arbitrary sound sources, even those it hasn’t seen during training.
The researchers believe a noise-eliminating system like SoundFilter could be used to create a range of useful technologies. For instance, Google drew on audio from thousands of its own meetings and YouTube videos to train the noise-canceling algorithm in Google Meet. Meanwhile, a team of Carnegie Mellon researchers created a “sound-action-vision” corpus to anticipate where objects will move when subjected to physical force. Read More
This spooky deepfake AI mimics dozens of celebs and politicians
The voice sounds oddly familiar, like I’ve heard it a thousand times before — and I have. Indeed, it sounds just like Sir David Attenborough. But it’s not him. It’s not a person at all.
It’s simply a piece of AI software called Vocodes. The tool, which I can best describe as a deepfake generator, can mimic the voices of a slew of politicians and celebrities including Donald Trump, Barack Obama, Bryan Cranston, Danny Devito, and a dozen more. Read More
Researchers develop AI that reads lips from video footage
AI and machine learning algorithms capable of reading lips from videos aren’t anything out of the ordinary, in truth. Back in 2016, researchers from Google and the University of Oxford detailed a system that could annotate video footage with 46.8% accuracy, outperforming a professional human lip-reader’s 12.4% accuracy. But even state-of-the-art systems struggle to overcome ambiguities in lip movements, preventing their performance from surpassing that of audio-based speech recognition.
In pursuit of a more performant system, researchers at Alibaba, Zhejiang University, and the Stevens Institute of Technology devised a method dubbed Lip by Speech (LIBS), which uses features extracted from speech recognizers to serve as complementary clues. They say it manages industry-leading accuracy on two benchmarks, besting the baseline by a margin of 7.66% and 2.75% in character error rate. Read More
High quality, lightweight and adaptable TTS using LPCNet
We present a lightweight adaptable neural TTS system with high quality output. The system is composed of three separate neural network blocks: prosody prediction, acoustic feature prediction and Linear Prediction Coding Net as a neural vocoder. This system can synthesize speech with close to natural quality while running 3 times faster than real-time on a standard CPU.
The modular setup of the system allows for simple adaptation to new voices with a small amount of data. Read More
Amazon and Leading Technology Companies Announce the Voice Interoperability Initiative
Today, Amazon (NASDAQ: AMZN) and leading technology companies announced the Voice Interoperability Initiative, a new program to ensure voice-enabled products provide customers with choice and flexibility through multiple, interoperable voice services. The initiative is built around a shared belief that voice services should work seamlessly alongside one another on a single device, and that voice-enabled products should be designed to support multiple simultaneous wake words.More than 30 companies are supporting the effort, including global brands like Amazon, Baidu, BMW, Bose, Cerence, ecobee, Harman, Logitech, Microsoft, Salesforce, Sonos, Sound United, Sony Audio Group, Spotify and Tencent; telecommunications operators like Free, Orange, SFR and Verizon; hardware solutions providers like Amlogic, InnoMedia, Intel, MediaTek, NXP Semiconductors, Qualcomm Technologies, Inc., SGW Global and Tonly; and systems integrators like CommScope, DiscVision, Libre, Linkplay, MyBox, Sagemcom, StreamUnlimited and Sugr. Read More
Emotionless: Privacy-Preserving Speech Analysis for Voice Assistants
Voice-enabled interactions provide more human-like experiences in many popular IoT systems. Cloud-based speech analysis services extract useful information from voice input using speech recognition techniques. The voice signal is a rich resource that discloses several possible states of a speaker, such as emotional state, confidence and stress levels,physical condition, age, gender, and personal traits. Service providers can build a very accurate profile of a user’s demographic category, personal preferences, and may compromise privacy. To address this problem, a privacy-preserving intermediate layer between users and cloud services is proposed to sanitize the voice input. It aims to maintain utility while preserving user privacy. It achieves this by collecting real time speech data and analyzes the signal to ensure privacy protection prior to sharing of this data with services providers. Precisely, the sensitive representations are extracted from the raw signal by using transformation functions and then wrapped it via voice conversion technology.Experimental evaluation based on emotion recognition to assess the efficacy of the proposed method shows that identification of sensitive emotional state of the speaker is reduced by∼96 %. Read More