A few days ago OpenAI released 2 impressive models CLIP and DALL-E. While DALL-E is able to generate text from images, CLIP classifies a very wide range of images by turning image classification into a text similarity problem. The issue with current image classification networks is that they are trained on a fixed number of categories, CLIP doesn’t work this way, it learns directly from the raw text about images, and thus it isn’t limited by labels and supervision. This is quite impressive, CLIP can classify images with state of the art accuracy without any dataset-specific training. Read More
Tag Archives: Image Recognition
Bird by Bird using Deep Learning
This article demonstrates how deep learning models used for image-related tasks can be advanced in order to address the fine-grained classification problem. For this objective, we will walk through the following two parts. First, you will get familiar with some basic concepts of computer vision and convolutional neural networks, while the second part demonstrates how to apply this knowledge to a real-world problem of bird species classification using PyTorch. Specifically, you will learn how to build your own CNN model – ResNet-50, – to further improve its performance using transfer learning, auxiliary task and attention-enhanced architecture, and even a little more. Read More
Adaptive Discriminator Augmentation: GAN Training Breakthrough for Limited Data Applications
Neuroscientists find a way to make object-recognition models perform better
Computer vision models known as convolutional neural networks can be trained to recognize objects nearly as accurately as humans do. However, these models have one significant flaw: Very small changes to an image, which would be nearly imperceptible to a human viewer, can trick them into making egregious errors such as classifying a cat as a tree.
A team of neuroscientists from MIT, Harvard University, and IBM have developed a way to alleviate this vulnerability, by adding to these models a new layer that is designed to mimic the earliest stage of the brain’s visual processing system. In a new study, they showed that this layer greatly improved the models’ robustness against this type of mistake. Read More
Contrastive Learning of Medical Visual Representations from Paired Images and Text
Learning visual representations of medical images is core to medical image understanding but its progress has been held back by the small size of hand-labeled datasets. Existing work commonly relies on transferring weights from ImageNet pretraining, which is suboptimal due to drastically different image characteristics,or rule-based label extraction from the textual report data paired with medical images, which is inaccurate and hard to generalize. We propose an alternative unsupervised strategy to learn medical visual representations directly from the naturally occurring pairing of images and textual data. Our method of pretraining medical image encoders with the paired text data via a bidirectional contrastive objective between the two modalities is domain-agnostic, and requires no additional expert input. We test our method by transferring our pretrained weights to 4 medical image classification tasks and 2 zero-shot retrieval tasks, and show that our method leads to image representations that considerably outperform strong base-lines in most settings. Notably, in all 4 classification tasks, our method requires only 10% as much labeled training data as an ImageNet initialized counterpart to achieve better or comparable performance, demonstrating superior data efficiency. Read More
Guide to Visual Recognition Datasets for Deep Learning with Python Code
Some visual recognition datasets have set benchmarks for supervised learning (Caltech101, Caltech256, CaltechBirds, CIFAR-10 andCIFAR-100) and unsupervised or self-taught learning algorithms(STL10) using deep learning across different object categories for various researches and developments. Under visual recognition mainly comes image classification, image segmentation and localization, object detection and various other use case problems. Many of these datasets have APIs present across some deep learning frameworks. This article talks about some of these datasets features along with some python code snippets on how to use them. Read More
Air Force bases look to facial recognition to secure entry
Two Air Force installations recently inked deals to use facial recognition technology to verify the identities of those coming on base — a move that can increase the physical distance during security checks as the coronavirus pandemic continues.
The Air Force awarded TrueFace phase two Small Business Innovation Research contracts to install its technology at Eglin Air Force Base and Joint Base McGuire-Dix-Lakehurst. The company calls its system “frictionless access control,” where security personnel do not need to be present for a check, adding that it can verify a face in one to two seconds. Read More
Gait-based Emotion Learning
Deep Learning with CIFAR-10
Image Classification using CNN
Neural Networks are the programmable patterns that helps to solve complex problems and bring the best achievable output. Deep Learning as we all know is a step ahead of Machine Learning, and it helps to train the Neural Networks for getting the solution of questions unanswered and or improving the solution!
In this article, we will be implementing a Deep Learning Model using CIFAR-10 dataset. Read More
Detecting Deep-Fake Videos from Phoneme-Viseme Mismatches
Recent advances in machine learning and computer graphics have made it easier to convincingly manipulate video and audio. These so-called deep-fake videos range from complete full-face synthesis and replacement (face-swap), to complete mouth and audio synthesis and replacement (lip-sync), and partial word-based audio and mouth synthesis and replacement. Detection of deep fakes with only a small spatial and temporal manipulation is particularly challenging. We describe a technique to detect such manipulated videos by exploiting the fact that the dynamics of the mouth shape – visemes – are occasionally inconsistent with a spoken phoneme. We focus on the visemes associated with words having the sound M(mama), B(baba), or P(papa) in which the mouth must completely close in order to pronounce these phonemes. We observe that this is not the case in many deep-fake videos. Such phoneme-viseme mismatches can, therefore, be used to detect even spatially small and temporally localized manipulations. We demonstrate the efficacy and robustness of this approach to detect different types of deep-fake videos, including in-the-wild deep fakes. Read More