Google’s new AI can hear a snippet of song—and then keep on playing

The technique, called AudioLM, generates naturalistic sounds without the need for human annotation.

A new AI system can create natural-sounding speech and music after being prompted with a few seconds of audio.

AudioLM, developed by Google researchers, generates audio that fits the style of the prompt, including complex sounds like piano music, or people speaking, in a way that is almost indistinguishable from the original recording. The technique shows promise for speeding up the process of training AI to generate audio, and it could eventually be used to auto-generate music to accompany videos. Read More

#audio

Who Are You (I Really Wanna Know)? Detecting Audio DeepFakes Through Vocal Tract Reconstruction

Generative machine learning models have made convincing voice synthesis a reality. While such tools can be extremely useful in applications where people consent to their voices being cloned (e.g., patients losing the ability to speak, actors not wanting to have to redo dialog, etc), they also allow for the creation of nonconsensual content known as deepfakes. This malicious audio is problematic not only because it can convincingly be used to impersonate arbitrary users, but because detecting deepfakes is challenging and generally requires knowledge of the specific deepfake generator. In this paper, we develop a new mechanism for detecting audio deepfakes using techniques from the field of articulatory phonetics. Specifically, we apply fluid dynamics to estimate the arrangement of the human vocal tract during speech generation and show that deepfakes often model impossible or highly-unlikely anatomical arrangements. When parameterized to achieve 99.9% precision, our detection mechanism achieves a recall of 99.5%, correctly identifying all but one deepfake sample in our dataset. We then discuss the limitations of this approach, and how deepfake models fail to reproduce all aspects of speech equally. In so doing, we demonstrate that subtle, but biologically constrained aspects of how humans generate speech are not captured by current models, and can therefore act as a powerful tool to detect audio deepfakes. Read More

#adversarial, #audio, #fake

Generating Animations From Audio With NVIDIA’s Deep Learning Tech

Check out a tool in beta called Omniverse Audio2Face that lets you quickly generate new animations.

In case you missed the news, NVIDIA has a tool in beta that lets you quickly and easily generate expressive facial animation from just an audio source using the team’s deep learning-based technology. The Audio2Face tool allows users to simplify the animation of 3D characters for a game, film, real-time digital assistants, and other projects. The toolkit lets you run the results live or bake them out.  Read More

#audio, #image-recognition