Google’s new AI can hear a snippet of song—and then keep on playing

The technique, called AudioLM, generates naturalistic sounds without the need for human annotation.

A new AI system can create natural-sounding speech and music after being prompted with a few seconds of audio.

AudioLM, developed by Google researchers, generates audio that fits the style of the prompt, including complex sounds like piano music, or people speaking, in a way that is almost indistinguishable from the original recording. The technique shows promise for speeding up the process of training AI to generate audio, and it could eventually be used to auto-generate music to accompany videos. Read More

#audio

Generative AI: A Creative New World

Humans are good at analyzing things. Machines are even better. Machines can analyze a set of data and find patterns in it for a multitude of use cases, whether it’s fraud or spam detection, forecasting the ETA of your delivery or predicting which TikTok video to show you next. They are getting smarter at these tasks. This is called “Analytical AI,” or traditional AI. 

But humans are not only good at analyzing things—we are also good at creating. We write poetry, design products, make games and crank out code. Up until recently, machines had no chance of competing with humans at creative work—they were relegated to analysis and rote cognitive labor. But machines are just starting to get good at creating sensical and beautiful things. This new category is called “Generative AI,” meaning the machine is generating something new rather than analyzing something that already exists. 

Generative AI is well on the way to becoming not just faster and cheaper, but better in some cases than what humans create by hand. Every industry that requires humans to create original work—from social media to gaming, advertising to architecture, coding to graphic design, product design to law, marketing to sales—is up for reinvention. Certain functions may be completely replaced by generative AI, while others are more likely to thrive from a tight iterative creative cycle between human and machine—but generative AI should unlock better, faster and cheaper creation across a wide range of end markets. The dream is that generative AI brings the marginal cost of creation and knowledge work down towards zero, generating vast labor productivity and economic value—and commensurate market cap. Read More

#image-recognition, #nlp, #strategy

Policing in the metaverse: what law enforcement needs to know

The metaverse has been described as the next iteration of the internet. This report provides a first, law enforcement-centric outlook at current developments on the topic, potential implications for law enforcement, as well as key recommendations as to what the law enforcement community could do to prepare for the future. This report aims to help police chiefs, law enforcement agencies and policy makers to begin to grasp this new environment so that they can adapt and prepare for policing in the metaverse.

This is the latest report produced by the Observatory Function of the Europol Innovation Lab. The Observatory Function monitors technological developments that are relevant for law enforcement and reports on the risks, threats and opportunities of these emerging technologies. Read More

#metaverse, #surveillance

AI Data Laundering: How Academic and Nonprofit Researchers Shield Tech Companies from Accountability

Yesterday, Meta’s AI Research Team announced Make-A-Video, a “state-of-the-art AI system that generates videos from text.”

Like he did for the Stable Diffusion data, Simon Willison created a Datasette browser to explore WebVid-10M, one of the two datasets used to train the video generation model, and quickly learned that all 10.7 million video clips were scraped from Shutterstock, watermarks and all.

XIn addition to the Shutterstock clips, Meta also used 10 million video clips from this 100M video dataset from Microsoft Research Asia. It’s not mentioned on their GitHub, but if you dig into the paper, you learn that every clip came from over 3 million YouTube videos.

So, in addition to a massive chunk of Shutterstock’s video collection, Meta is also using millions of YouTube videos collected by Microsoft to make its text-to-video AI. Read More

#ethics, #image-recognition, #nlp

An Open Letter to the Robotics Industry and our Communities,

General Purpose Robots Should Not Be Weaponized

We are some of the world’s leading companies dedicated to introducing new generations of advanced mobile robotics to society. These new generations of robots are more accessible, easier to operate, more autonomous, affordable, and adaptable than previous generations, and capable of navigating into locations previously inaccessible to automated or remotely-controlled technologies. We believe that advanced mobile robots will provide great benefit to society as co-workers in industry and companions in our homes.

…We pledge that we will not weaponize our advanced-mobility general-purpose robots or the software we develop that enables advanced robotics and we will not support others to do so. When possible, we will carefully review our customers’ intended applications to avoid potential weaponization. We also pledge to explore the development of technological features that could mitigate or reduce these risks. To be clear, we are not taking issue with existing technologies that nations and their government agencies use to defend themselves and uphold their laws. Read More

#ethics, #robotics

OpenAI Whisper Holds the Key to GPT-4

And 8 key features that make it the best ASR model (hey Siri, this one’s for you)

The following is a selection from The Algorithmic Bridge, an educational newsletter whose purpose is to bridge the gap between algorithms and people. It will help you understand the impact AI has in your life and develop the tools to better navigate the future.

Today I’m covering a subfield of AI I’d never thought I’d be writing about — mostly because it’s much more mature than the ones I usually write about (Large language models, AI art) and no breakthroughs were in sight. But I was mistaken. I’m referring to automatic speech recognition (ASR), as you surely have inferred from the headline… Bear with me because there are good reasons to read this news.

OpenAI announced Whisper a couple of days ago and people are already going crazy. Not because it’s a new concept or because of improvements in algorithm design. No, the reason is simpler: Whisper works better than any other commercial ASR system. Alexa, Siri, Google Assistant (these are the ones you’re probably familiar with), any of them will feel like last-century tech after you try Whisper. And you can. OpenAI, the company that tends to not do justice to its name, decided to open source this model. The digital experience is going to radically change for many people. Read More

#nlp

Generative Spoken Dialogue Language Modeling

We introduce dGSLM, the first “textless” model able to generate audio samples of naturalistic spoken dialogues. It uses recent work on unsupervised spoken unit discovery coupled with a dual-tower transformer architecture with cross-attention trained on 2000 hours of two-channel raw conversational audio (Fisher dataset) without any text or labels. It is able to generate speech, laughter and other paralinguistic signals in the two channels simultaneously and reproduces naturalistic turn taking. Generation samples can be found at: https://speechbot.github.io/dgslm. Read More

#nlp

Cryogeomorphic Characterization of Shadowed Regions in the Artemis Exploration Zone

The Artemis program will send crew to explore the south polar region of the Moon, preceded by and integrated with robotic missions. One of the main scientific goals of future exploration is the characterization of polar volatiles, which are concentrated in and near regions of permanent shadow. The meter-scale cryogeomorphology of shadowed regions remains unknown, posing a potential risk to missions that plan to traverse or land in them. Here, we deploy a physics-based, deep learning-driven post-processing tool to produce high-signal and high-resolution Lunar Reconnaissance Orbiter Narrow Angle Camera images of 44 shadowed regions larger than ∼40 m across in the Artemis exploration zone around potential landing sites 001 and 004. We use these images to map previously unknown, shadowed meter-scale (cryo)geomorphic features, assign relative shadowed region ages, and recommend promising sites for future exploration. We freely release our data and a detailed catalog of all shadowed regions studied. Read More

#image-recognition

Robots are making French fries faster, better than humans

Fast-food French fries and onion rings are going high-tech, thanks to a company in Southern California.

Miso Robotics Inc in Pasadena has started rolling out its Flippy 2 robot, which automates the process of deep frying potatoes, onions and other foods.

A big robotic arm like those in auto plants – directed by cameras and artificial intelligence – takes frozen French fries and other foods out of a freezer, dips them into hot oil, then deposits the ready-to-serve product into a tray. Read More

#robotics

Google’s newest AI generator creates HD video from text prompts

Not to be outdone by Meta, Google’s AI generator can output 1280×768 HD video at 24 fps.

Today, Google announced the development of Imagen Video, a text-to-video AI mode capable of producing 1280×768 videos at 24 frames per second from a written prompt. Currently, it’s in a research phase, but its appearance five months after Google Imagen points to the rapid development of video synthesis models.

Only six months after the launch of OpenAI’s DALLE-2 text-to-image generator, progress in the field of AI diffusion models has been heating up rapidly. Google’s Imagen Video announcement comes less than a week after Meta unveiled its text-to-video AI tool, Make-A-Video.

According to Google’s research paper, Imagen Video includes several notable stylistic abilities, such as generating videos based on the work of famous painters (the paintings of Vincent van Gogh, for example), generating 3D rotating objects while preserving object structure, and rendering text in a variety of animation styles. Google is hopeful that general-purpose video synthesis models can “significantly decrease the difficulty of high-quality content generation.” Read More

#image-recognition, #nlp