The recent success of self-supervised learning can be largely attributed to content-preserving transformations, which can be used to easily induce invariances. While transformations generate positive sample pairs in contrastive loss training, most recent work focuses on developing new objective formulations, and pays rela-tively little attention to the transformations themselves. In this paper, we introduce the framework of Generalized Data Transformations to (1) reduce several recent self-supervised learning objectives to a single formulation for ease of comparison,analysis, and extension, (2) allow a choice between being invariant or distinctive to data transformations, obtaining different supervisory signals, and (3) derive the conditions that combinations of transformations must obey in order to lead to well-posed learning objectives. This framework allows both invariance and distinctiveness to be injected into representations simultaneously, and lets us systematically explore novel contrastive objectives. We apply it to study multi-modal self-supervision for audio-visual representation learning from unlabelled videos,improving the state-of-the-art by a large margin, and even surpassing supervised pretraining. We demonstrate results on a variety of downstream video and audio classification and retrieval tasks, on datasets such as HMDB-51, UCF-101,DCASE2014, ESC-50 and VGG-Sound. In particular, we achieve new state-of-the-art accuracies of 72.8% on HMDB-51 and 95.2% on UCF-101. Read More
#image-recognition, #self-supervisedTag Archives: Image Recognition
Facebook’s next big AI project is training its machines on users’ public videos
AI that can understand video could be put to a variety of uses
Teaching AI systems to understand what’s happening in videos as completely as a human can is one of the hardest challenges — and biggest potential breakthroughs — in the world of machine learning. Today, Facebook announced a new initiative that it hopes will give it an edge in this consequential work: training its AI on Facebook users’ public videos.
Access to training data is one of the biggest competitive advantages in AI, and by collecting this resource from millions and millions of their users, tech giants like Facebook, Google, and Amazon have been able to forge ahead in various areas. And while Facebook has already trained machine vision models on billions of images collected from Instagram, it hasn’t previously announced projects of similar ambition for video understanding. Read More
Neural Body: Implicit Neural Representations with Structured Latent Codes for Novel View Synthesis of Dynamic Humans
This paper addresses the challenge of novel view synthe-sis for a human performer from a very sparse set of cameraviews. Some recent works have shown that learning implicitneural representations of 3D scenes achieves remarkableview synthesis quality given dense input views. However,the representation learning will be ill-posed if the views arehighly sparse. To solve this ill-posed problem, our key ideais to integrate observations over video frames. To this end,we propose Neural Body, a new human body representationwhich assumes that the learned neural representations atdifferent frames share the same set of latent codes anchoredto a deformable mesh, so that the observations acrossframes can be naturally integrated. The deformable meshalso provides geometric guidance for the network to learn3D representations more efficiently. Experiments on a newlycollected multi-view dataset show that our approach out-performs prior works by a large margin in terms of the viewsynthesis quality. We also demonstrate the capability of ourapproach to reconstruct a moving person from a monocularvideo on the People-Snapshot dataset. The code and datasetwill be available at https://zju3dv.github.io/neuralbody/. Read More
#image-recognitionAm I a Real or Fake Celebrity?
Recently, significant advancements have been made in face recognition technologies using Deep Neural Networks. As a result, companies such as Microsoft, Amazon, and Naver offer highly accurate commercial face recognition web services for diverse applications to meet the end-user needs. Naturally, however, such technologies are threatened persistently, as virtually any individual can quickly implement impersonation attacks. In particular, these attacks can be a significant threat for authentication and identification services, which heavily rely on their underlying face recognition technologies’ accuracy and robustness. Despite its gravity, the issue regarding deepfake abuse using commercial web APIs and their robustness has not yet been thoroughly investigated. This work provides a measurement study on the robustness of black-box commercial face recognition APIs against Deepfake Impersonation (DI) attacks using celebrity recognition APIs as an example case study. We use five deepfake datasets, two of which are created by us and planned to be released. More specifically, we measure attack performance based on two scenarios (targeted and non-targeted) and further analyze the differing system behaviors using fidelity, confidence, and similarity metrics. Accordingly, we demonstrate how vulnerable face recognition technologies from popular companies are to DI attack, achieving maximum success rates of 78.0% and 99.9% for targeted (i.e., precise match) and non-targeted (i.e., match with any celebrity) attacks, respectively. Moreover, we propose practical defense strategies to mitigate DI attacks, reducing the attack success rates to as low as 0% and 0.02% for targeted and non-targeted attacks, respectively. Read More
#fake, #image-recognitionTom Cruise deepfake creator says public shouldn’t be worried about ‘one-click fakes’
Weeks of work and a top impersonator were needed to make the viral clips
When a series of spookily convincing Tom Cruise deepfakes went viral on TikTok, some suggested it was a chilling sign of things to come — harbinger of an era where AI will let anyone make fake videos of anyone else. The video’s creator, though, Belgium VFX specialist Chris Ume, says this is far from the case. Speaking to The Verge about his viral clips, Ume stresses the amount of time and effort that went into making each deepfake, as well as the importance of working with a top-flight Tom Cruise impersonator, Miles Fisher.
“You can’t do it by just pressing a button,” says Ume. “That’s important, that’s a message I want to tell people.” Each clip took weeks of work, he says, using the open-source DeepFaceLab algorithm as well as established video editing tools. “By combining traditional CGI and VFX with deepfakes, it makes it better. I make sure you don’t see any of the glitches.” Read More
Self-supervised Pretraining of Visual Features in the Wild
Recently,self-supervised learning methods like MoCo [22], SimCLR [8], BYOL [20] and SwAV [7] have reduced the gap with supervised methods.These results have been achieved in a control environment, that is the highly curated ImageNet dataset. However, the premise of self-supervised learning is that it can learn from any random image and from any unbounded dataset. In this work, we explore if self-supervision lives to its expectation by training large models on random, uncurated images with no supervision. Our final SElf-supERvised (SEER) model,a RegNetY with 1.3B parameters trained on 1B random images with 512 GPUs achieves 84.2% top-1 accuracy,surpassing the best self-supervised pretrained model by 1%and confirming that self-supervised learning works in areal world setting. Interestingly, we also observe that self-supervised models are good few-shot learners achieving77.9% top-1 with access to only 10% of ImageNet. Read More
Intel and EXOS Pilot 3D Athlete Tracking with Pro Football Hopefuls
New AI ‘Deep Nostalgia’ brings old photos, including very old ones, to life
It seems like a nice idea in theory but it’s a tiny bit creepy as well
An AI-powered service called Deep Nostalgia that animates still photos has become the main character on Twitter this fine Sunday, as people try to create the creepiest fake “video” possible, apparently.
The Deep Nostalgia service, offered by online genealogy company MyHeritage, uses AI licensed from D-ID to create the effect that a still photo is moving. It’s kinda like the iOS Live Photos feature, which adds a few seconds of video to help smartphone photographers find the best shot. Read More
VinVL: Making Visual Representations Matter in Vision-Language Models
This paper presents a detailed study of improving visual representations for vision language (VL)tasks and develops an improved object detection model to provide object-centric representations of images. Compared to the most widely used bottom-up and top-down model [2], the new model is bigger,better-designed for VL tasks, and pre-trained on much larger training corpora that combine multiple public annotated object detection datasets. Therefore, it can generate representations of a richer collection of visual objects and concepts. While previous VL research focuses mainly on improving the vision-language fusion model and leaves the object detection model improvement untouched, we show that visual features matter significantly in VL models. In our experiments we feed the visual features generated by the new object detection model into a Transformer-based VL fusion model OSCAR[21],and utilize an improved approach OSCAR+ to pre-train the VL model and fine-tune it on a wide range of downstream VL tasks. Our results show that the new visual features significantly improve the performance across all VL tasks, creating new state-of-the-art results on seven public benchmarks. We will release the new object detection model to public. Read More
This is how we lost control of our faces
The largest ever study of facial-recognition data shows how much the rise of deep learning has fueled a loss of privacy.
In 1964, mathematician and computer scientist Woodrow Bledsoe first attempted the task of matching suspects’ faces to mugshots. He measured out the distances between different facial features in printed photographs and fed them into a computer program. His rudimentary successes would set off decades of research into teaching machines to recognize human faces.
Now a new study shows just how much this enterprise has eroded our privacy. It hasn’t just fueled an increasingly powerful tool of surveillance. The latest generation of deep-learning-based facial recognition has completely disrupted our norms of consent. Read More