We’ve trained and are open-sourcing a neural net called Whisper that approaches human level robustness and accuracy on English speech recognition.
Whisper is an automatic speech recognition (ASR) system trained on 680,000 hours of multilingual and multitask supervised data collected from the web. We show that the use of such a large and diverse dataset leads to improved robustness to accents, background noise and technical language. Moreover, it enables transcription in multiple languages, as well as translation from those languages into English. We are open-sourcing models and inference code to serve as a foundation for building useful applications and for further research on robust speech processing. Read More
Monthly Archives: September 2022
State of Data Science 2022: Paving the Way for Innovation
Anaconda’s 2022 State of Data Science report is here! As with years prior, we conducted a survey to gather demographic information about our community, ascertain how that community works, and collect insights into big questions and trends that are top of mind within the community. As the impacts of COVID continue to linger and assimilate into our new normal, we decided to move away from covering COVID themes in our report and instead focus on more actionable issues within the data science, machine learning (ML), and artificial intelligence industries, like open-source security, the talent dilemma, ethics and bias, and more. Read More
Read the Report
PP-Matting: High-Accuracy Natural Image Matting
Natural image matting is a fundamental and challenging computer vision task. It has many applications in image editing and composition. Recently, deep learning-based approaches have achieved great improvements in image matting. However, most of them require a user-supplied trimap as an auxiliary input, which limits the matting applications in the real world. Although some trimap-free approaches have been proposed, the matting quality is still unsatisfactory compared to trimap-based ones. Without the trimap guidance, the matting models suffer from foreground-background ambiguity easily, and also generate blurry details in the transition area. In this work, we propose PP-Matting, a trimap-free architecture that can achieve high-accuracy natural image matting. Our method applies a high-resolution detail branch (HRDB) that extracts fine-grained details of the foreground with keeping feature resolution unchanged. Also, we propose a semantic context branch (SCB) that adopts a semantic segmentation subtask. It prevents the detail prediction from local ambiguity caused by semantic context missing. In addition, we conduct extensive experiments on two well-known benchmarks: Composition-1k and Distinctions-646. The results demonstrate the superiority of PP-Matting over previous methods. Furthermore, we provide a qualitative evaluation of our method on human matting which shows its outstanding performance in the practical application. Read More
OpenAI begins allowing users to edit faces using DALL-E 2
After initially disabling the capability, OpenAI today announced that customers with access to DALL-E 2 can upload people’s faces to edit them using the AI-powered image-generating system. Previously, OpenAI only allowed users to work with and share photorealistic faces and banned the uploading of any photo that might depict a real person, including photos of prominent celebrities and public figures.
OpenAI claims that improvements to its safety system made the face-editing feature possible by “minimizing the potential of harm” from deepfakes as well as attempts to create sexual, political and violent content. In an email to customers, the company wrote:
Read MoreMany of you have told us that you miss using DALL-E to dream up outfits and hairstyles on yourselves and edit the backgrounds of family photos. A reconstructive surgeon told us that he’d been using DALL-E to help his patients visualize results. And filmmakers have told us that they want to be able to edit images of scenes with people to help speed up their creative processes … [We] built new detection and response techniques to stop misuse.
#image-recognition, #nlp
Brain Map
Functional magnetic resonance imaging (fMRI) was used to measure brain activity in seven people while they listened to more than 2 hours of stories from The Moth Radio Hour. This data was used to estimate voxel-wise models that predict brain activity in each voxel (volumetric pixel) based on the meaning of the words in the stories. Read the paper describing this research here.
This site provides an interactive 3D viewer for models fit to one subject’s brain. Read More
DeepMind Says It Had Nothing to Do With Research Paper Saying AI Could End Humanity
After a researcher with a position at DeepMind—the machine intelligence firm owned by Google parent Alphabet—co-authored a paper claiming that AI could feasibly wipe out humanity one day, DeepMind is distancing itself from the work.
The paper was published recently in the peer-reviewed AI Magazine, and was co-authored by researchers at Oxford University and by Marcus Hutter, an AI researcher who works at DeepMind. The first line of Hutter’s website states the following: “I am Senior Researcher at Google DeepMind in London, and Honorary Professor in the Research School of Computer Science (RSCS) at the Australian National University (ANU) in Canberra.” The paper, which currently lists his affiliation to DeepMind and ANU, runs through some thought experiments about humanity’s future with a superintelligent AI that operates using similar schemes to today’s machine learning programs, such as reward-seeking. It concluded that this scenario could erupt into a zero-sum game between humans and AI that would be “fatal” if humanity loses out. Read More
Read the Paper
D-ID, the company behind Deep Nostalgia, lets you create AI-generated videos from a single image
sraeli AI company D-ID, which provided technology for projects like Deep Nostalgia, is launching a new platform where users can upload a single image and text to generate video. With this new site called Creative Reality Studio, the company is targeting sectors like corporate training and education, internal and external communication from companies, product marketing and sales.
The platform is pretty simple to use: Users can upload an image of a presenter or select one from the pre-created presenters to start the video creation process. Paid users can access premium presenters who are more “expressive” as they have better facial expressions and hand movements than the default ones. After that, users can either type the text from a script or simply upload an audio clip of someone’s speech. Users can then select a language (the platform supports 119 languages), voice and styles like cheerful, sad, excited and friendly.
The company’s AI-based algorithms will generate a video based on these parameters. Users can then distribute the video anywhere. The firm claims that the algorithm takes only half of the video duration time to generate a clip, but in our tests, it took a couple of minutes to generate a one-minute video. This could change depending on the type of presenter and language you selected. Read More
10 years later, deep learning ‘revolution’ rages on, say AI pioneers Hinton, LeCun and Li
Artificial intelligence (AI) pioneer Geoffrey Hinton, one of the trailblazers of the deep learning “revolution” that began a decade ago, says that the rapid progress in AI will continue to accelerate.
In an interview before the 10-year anniversary of key neural network research that led to a major AI breakthrough in 2012, Hinton and other leading AI luminaries fired back at some critics who say deep learning has “hit a wall.”
“We’re going to see big advances in robotics — dexterous, agile, more compliant robots that do things more efficiently and gently like we do,” Hinton said.
Other AI pathbreakers, including Yann LeCun, head of AI and chief scientist at Meta and Stanford University professor Fei-Fei Li, agree with Hinton that the results from the groundbreaking 2012 research on the ImageNet database — which was built on previous work to unlock significant advancements in computer vision specifically and deep learning overall — pushed deep learning into the mainstream and have sparked a massive momentum that will be hard to stop. Read More
Midjourney AI Art VS Artist – Testing Ai art to see if it can replicate my artwork
WenLan: Bridging Vision and Language by Large-Scale Multi-Modal Pre-Training
Multi-modal pre-training models have been intensively explored to bridge vision and language in recent years. However, most of them explicitly model the cross-modal interaction between image-text pairs, by assuming that there exists strong semantic correlation between the text and image modalities. Since this strong assumption is often invalid in real-world scenarios, we choose to implicitly model the cross-modal correlation for large-scale multi-modal pretraining, which is the focus of the Chinese project ‘WenLan’ led by our team. Specifically, with the weak correlation assumption over image-text pairs, we propose a twotower pre-training model called BriVL within the crossmodal contrastive learning framework. Unlike OpenAI CLIP that adopts a simple contrastive learning method, we devise a more advanced algorithm by adapting the latest method MoCo into the cross-modal scenario. By building a large queue-based dictionary, our BriVL can incorporate more negative samples in limited GPU resources. We further construct a large Chinese multi-source imagetext dataset called RUC-CAS-WenLan for pre-training our BriVL model. Extensive experiments demonstrate that the pre-trained BriVL model outperforms both UNITER and OpenAI CLIP on various downstream tasks. Read More
#multi-modal