In self-supervised learning, a system is tasked with achieving a surrogate objective by defining alternative targets on a set of unlabeled data. The aim is to build useful representations that can be used in downstream tasks, without costly manual annotation. In this work, we propose a novel self-supervised formulation of relational reasoning that allows a learner to bootstrap a signal from information implicit in unlabeled data. Training a relation head to discriminate how entities relate to themselves (intra-reasoning) and other entities (inter-reasoning), results in rich and descriptive representations in the underlying neural network backbone, which can be used in downstream tasks such as classification and image retrieval. We evaluate the proposed method following a rigorous experimental procedure, using standard datasets, protocols, and backbones. Self-supervised relational reasoning outperforms the best competitor in all conditions by an average 14% in accuracy,and the most recent state-of-the-art model by 3%. We link the effectiveness of the method to the maximization of a Bernoulli log-likelihood, which can be considered as a proxy for maximizing the mutual information, resulting in a more efficient objective with respect to the commonly used contrastive losses. Read More
Tag Archives: Self Supervised
PULSE: Self-Supervised Photo Upsampling via Latent Space Exploration of Generative Models
The primary aim of single-image super-resolution is to construct a high-resolution (HR) image from a corresponding low-resolution (LR) input. In previous approaches,which have generally been supervised, the training objective typically measures a pixel-wise average distance be-tween the super-resolved (SR) and HR images. Optimizing such metrics often leads to blurring, especially in high variance (detailed) regions. We propose an alternative formulation of the super-resolution problem based on creating realistic SR images that downscale correctly. We present a novel super-resolution algorithm addressing this problem, PULSE (Photo Up sampling via Latent Space Exploration), which generates high-resolution, realistic images at resolutions previously unseen in the literature. It accomplishes this in an entirely self-supervised fashion and is not confined to a specific degradation operator used during training, unlike previous methods (which require training on databases of LR-HR image pairs for supervised learning). Instead of starting with the LR image and slowly adding detail, PULSE traverses the high-resolution natural image manifold, searching for images that downscale to the original LR image. This is formalized through the “down-scaling loss,” which guides exploration through the latent space of a generative model. By leveraging properties of high-dimensional Gaussians, we restrict the search space to guarantee that our outputs are realistic. PULSE thereby generates super-resolved images that both are realistic and downscale correctly. We show extensive experimental results demonstrating the efficacy of our approach in the do-main of face super-resolution (also known as face hallucination). Our method outperforms state-of-the-art methods in perceptual quality at higher resolutions and scale factors than previously possible. Read More
Self-supervised learning is the future of AI
Despite the huge contributions of deep learning to the field of artificial intelligence, there’s something very wrong with it: It requires huge amounts of data. This is one thing that both the pioneers and critics of deep learning agree on. In fact, deep learning didn’t emerge as the leading AI technique until a few years ago because of the limited availability of useful data and the shortage of computing power to process that data.
Reducing the data-dependency of deep learning is currently among the top priorities of AI researchers. Read More
Hand labeling is the past. The future is #NoLabel AI
Data labeling is so hot right now… but could this rapidly emerging market face disruption from a small team at Stanford and the Snorkel open source project, which enables highly efficient programmatic labeling that is 10 to 1,000x as efficient as hand labeling? Read More
Self-training with Noisy Student improves ImageNet classification
We present a simple self-training method that achieves 87.4% top-1 accuracy on ImageNet, which is 1.0% better than the state-of-the-art model that requires 3.5B weakly labeled Instagram images. On robustness test sets, it improves ImageNet-A top-1 accuracy from 16.6% to 74.2%, reduces ImageNet-C mean corruption error from 45.7 to 31.2, and reduces ImageNet-P mean flip rate from 27.8 to 16.1.
To achieve this result, we first train an EfficientNet model on labeled ImageNet images and use it as a teacher to generate pseudo labels on 300M unlabeled images. We then train a larger EfficientNet as a student model on the combination of labeled and pseudo labeled images. We iterate this process by putting back the student as the teacher. During the generation of the pseudo labels, the teacher is not noised so that the pseudo labels are as good as possible. But during the learning of the student, we inject noise such as data augmentation, dropout, stochastic depth to the student so that the noised student is forced to learn harder from the pseudo labels. Read More
Superhuman AI for multiplayer poker
In recent years there have been great strides in artificial intelligence (AI), with games often serving as challenge problems, benchmarks, and milestones for progress. Poker has served for decades as such a challenge problem. Past successes in such benchmarks, including poker, have been limited to two-player games. However, poker in particular is traditionally played with more than two players. Multiplayer games present fundamental additional issues beyond those in two-player games, and multiplayer poker is a recognized AI milestone. In this paper we present Pluribus, an AI that we show is stronger than top human professionals in six-player no-limit Texas hold’em poker, the most popular form of poker played by humans. Read More
Facebook: New AI tech spots hate speech faster
Facebook’s AI engineers have embraced a technology called self-supervised learning so the social network’s technology can adapt faster to challenges like spotting new forms of hate speech
Artificial intelligence is sweeping the tech industry, and beyond, as the new method for getting computers to recognize patterns and make decisions catches on. With today’s AI technology called deep learning, you can get a computer to recognize a cat by training it with lots of pictures of cats, instead of figuring out how to define cat characteristics like two eyes, pointy ears and whiskers.
Self-supervised learning, though, needs vastly less training data than regular AI training, which cuts the time needed to assemble training data and train a system. For example, self-supervised learning methods have cut the amount of training data needed by a factor of 10, Manohar Paluri, an AI research leader at Facebook, said Wednesday at the company’s F8 developer conference.
And that speed is critical to making Facebook fun and safe, not a cesspool of toxic comments, misinformation, abuse and scams. Read More
Self-Supervised GANs
If you aren’t familiar with Generative Adversarial Networks (GANs), they are a massively popular generative modeling technique formed by pitting two Deep Neural Networks, a generator and a discriminator, against each other. This adversarial loss has sparked the interest of many Deep Learning and Artificial Intelligence researchers. However, despite the beauty of the GAN formulation and the eye-opening results of the state-of-the-art architectures, GANs are generally very difficult to train. One of the best ways to get better results with GANs are to provide class labels. This is the basis of the conditional-GAN model. This article will show how Self-Supervised Learning can overcome the need for class labels for training GANs and rival the performance of conditional-GAN models.
Before we get into how Self-Supervised Learning improves GANs, we will introduce the concept of Self-Supervised Learning. Compared to the popular families of Supervised and Unsupervised Learning, Self-Supervised is most similar to Unsupervised Learning. Self-Supervised tasks include things such as image colorization, predicting the relative location of extracted patches from an image, or in this case, predicting the rotation angle of an image. These tasks are dubbed “Self-Supervised” because the data lends itself to these surrogate tasks. In this sense, the Self-Supervised tasks take the form of (X, Y) pairs, however, the X,Y pairs are automatically constructed from the dataset itself and do not require human labeling. The paper discussed in this article summarizes Self-Supervised Learning as, “one can make edits to the given image and ask the network to predict the edited part”. This is the basic idea behind Self-Supervised Learning. Read More
Self-supervised learning: (Auto)encoder networks
Network must copy inputs to outputs through a “bottleneck” (fewer hidden units)
Hidden representations become a learned compressed code of the inputs/outputs
Capture systematic structure among full set of patterns Due to bottleneck, don’t have capacity to over learn idiosyncratic aspects of particular patterns
For N linear hidden units, hidden representations span the same subspace as the first N principal components (≈PCA)
Read More
Self-supervised Learning of Geometrically Stable Features Through Probabilistic Introspection
Self-supervision can dramatically cut back the amount of manually-labeled data required to train deep neural net-works. While self-supervision has usually been considered for tasks such as image classification, in this paper we aim at extending it to geometry-oriented tasks such as semantic matching and part detection. We do so by building on several recent ideas in unsupervised landmark detection. Our approach learns dense distinctive visual descriptors from an unlabeled dataset of images using synthetic image trans-formations. It does so by means of a robust probabilistic formulation that can introspectively determine which image regions are likely to result in stable image matching. We show empirically that a network pretrained in this manner requires significantly less supervision to learn semantic object parts compared to numerous pretraining alternatives.We also show that the pretrained representation is excellent for semantic object matching. Read More