SALMONN, the First Model that Hears like Humans do

People often underestimate the importance of hearing to function correctly in our world and, more importantly, as an essential tool for learning.

As the famed Helen Keller once said, “Blindness cuts us off from things, but deafness cuts us off from people” and let’s not forget that this woman was blind and deaf.

Therefore, it’s only natural to see hearing as an indispensable requirement for AI to become the sought-after superior ‘being’ that some people predict it will become.

Sadly, current AI systems suck at hearing.

… Now, a new model created by the company behind TikTok, ByteDance, challenges this vision.

SALMONN is the first-ever multimodal audio-language AI system for generic hearing, a model that can process random audio signals from the three main sound types: speech, audio events, and music. — Read More

Read the Paper

#audio

The Beatles: ‘final’ song Now and Then to be released thanks to AI technology

Now and Then, the long-awaited “final” Beatles song featuring all four members, is to be released next week thanks to the same AI technology that was used to enhance the audio on Peter Jackson’s documentary Get Back.

“There it was, John’s voice, crystal clear,” Paul McCartney said in a statement. “It’s quite emotional. And we all play on it, it’s a genuine Beatles recording. In 2023, to still be working on Beatles music, and about to release a new song the public haven’t heard, I think it’s an exciting thing.” — Read More

Video

#audio

The REAL Fight Over AI Music – Ft. CEO of Spotify and Grimes

Read More

#audio, #videos

Stability AI, gunning for a hit, launches an AI-powered music generator

… Today marks the release of Stable Audio, a tool that Stability claims is the first capable of creating “high-quality,” 44.1 kHz music for commercial use via a technique called latent diffusion. Trained on audio metadata as well as audio files’ durations — and start times — Stability says that Audio Diffusion’s underlying, roughly 1.2-billion-parameter model affords greater control over the content and length of synthesized audio than the generative music tools released before it. — Read More

#audio

AI-Generated Masterpiece: 21 Savage x Travis Scott – Whiplash by @ghostwriter

Read More

#audio

Redub Me — Speak to the world!

Dub your content into 70+ languages at a click of a button, and reach millions of new fans. — Read More

#audio

Developers are now using AI for text-to-music apps

With the rise in popularity of Large Language Models (LLMs) and generative AI tools like ChatGPT, developers have found use cases to mold text in different ways for use cases ranging from writing emails to summarizing articles. Now, they are looking to help you generate bits of music by just typing some words.

Brett Bauman, the developer of PlayListAI (previously LinupSupply), launched a new app called Songburst on the App Store this week. The app doesn’t have a steep learning curve. You just have to type in a prompt like “Calming piano music to listen to while studying” or “Funky beats for a podcast intro” to let the app generate a music clip. — Read More

#audio

AudioSep — Separate Anything You Describe

Language-queried audio source separation (LASS) is a new paradigm for computational auditory scene analysis (CASA). LASS aims to separate a target sound from an audio mixture given a natural language query, which provides a natural and scalable interface for digital audio applications. Recent works on LASS, despite attaining promising separation performance on specific sources (e.g., musical instruments, limited classes of audio events), are unable to separate audio concepts in the open domain. In this work, we introduce AudioSep, a foundation model for open-domain audio source separation with natural language queries. We train AudioSep on large-scale multimodal datasets and extensively evaluate its capabilities on numerous tasks including audio event separation, musical instrument separation, and speech enhancement. AudioSep demonstrates strong separation performance and impressive zero-shot generalization ability using audio captions or text labels as queries, substantially outperforming previous audio-queried and language-queried sound separation models. For reproducibility of this work, we will release the source code, evaluation benchmark and pre-trained model at: this https URL. — Read More

#audio

New acoustic attack steals data from keystrokes with 95% accuracy

A team of researchers from British universities has trained a deep learning model that can steal data from keyboard keystrokes recorded using a microphone with an accuracy of 95%.

When Zoom was used for training the sound classification algorithm, the prediction accuracy dropped to 93%, which is still dangerously high, and a record for that medium.

Such an attack severely affects the target’s data security, as it could leak people’s passwords, discussions, messages, or other sensitive information to malicious third parties. — Read More

#audio, #surveillance

Open sourcing AudioCraft: Generative AI for audio made simple and available to all

Imagine a professional musician being able to explore new compositions without having to play a single note on an instrument. Or an indie game developer populating virtual worlds with realistic sound effects and ambient noise on a shoestring budget. Or a small business owner adding a soundtrack to their latest Instagram post with ease. That’s the promise of AudioCraft — our simple framework that generates high-quality, realistic audio and music from text-based user inputs after training on raw audio signals as opposed to MIDI or piano rolls.

AudioCraft consists of three models: MusicGenAudioGen, and EnCodec. MusicGen, which was trained with Meta-owned and specifically licensed music, generates music from text-based user inputs, while AudioGen, which was trained on public sound effects, generates audio from text-based user inputs. Today, we’re excited to release an improved version of our EnCodec decoder, which allows for higher quality music generation with fewer artifacts; our pre-trained AudioGen model, which lets you generate environmental sounds and sound effects like a dog barking, cars honking, or footsteps on a wooden floor; and all of the AudioCraft model weights and code. The models are available for research purposes and to further people’s understanding of the technology. We’re excited to give researchers and practitioners access so they can train their own models with their own datasets for the first time and help advance the state of the art. — Read More

#audio