Contrastive Learning of Medical Visual Representations from Paired Images and Text

Learning visual representations of medical images is core to medical image understanding but its progress has been held back by the small size of hand-labeled datasets. Existing work commonly relies on transferring weights from ImageNet pretraining, which is suboptimal due to drastically different image characteristics,or rule-based label extraction from the textual report data paired with medical images, which is inaccurate and hard to generalize. We propose an alternative unsupervised strategy to learn medical visual representations directly from the naturally occurring pairing of images and textual data. Our method of pretraining medical image encoders with the paired text data via a bidirectional contrastive objective between the two modalities is domain-agnostic, and requires no additional expert input. We test our method by transferring our pretrained weights to 4 medical image classification tasks and 2 zero-shot retrieval tasks, and show that our method leads to image representations that considerably outperform strong base-lines in most settings. Notably, in all 4 classification tasks, our method requires only 10% as much labeled training data as an ImageNet initialized counterpart to achieve better or comparable performance, demonstrating superior data efficiency. Read More

#image-recognition

Guide to Visual Recognition Datasets for Deep Learning with Python Code

Some visual recognition datasets have set benchmarks for supervised learning (Caltech101, Caltech256, CaltechBirds, CIFAR-10 andCIFAR-100) and unsupervised or self-taught learning algorithms(STL10) using deep learning across different object categories for various researches and developments. Under visual recognition mainly comes image classification, image segmentation and localization, object detection and various other use case problems. Many of these datasets have APIs present across some deep learning frameworks. This article talks about some of these datasets features along with some python code snippets on how to use them. Read More

#image-recognition, #python

Air Force bases look to facial recognition to secure entry

Two Air Force installations recently inked deals to use facial recognition technology to verify the identities of those coming on base — a move that can increase the physical distance during security checks as the coronavirus pandemic continues.

The Air Force awarded TrueFace phase two Small Business Innovation Research contracts to install its technology at Eglin Air Force Base and Joint Base McGuire-Dix-Lakehurst. The company calls its system “frictionless access control,” where security personnel do not need to be present for a check, adding that it can verify a face in one to two seconds. Read More

#dod, #image-recognition

Gait-based Emotion Learning

Read More
#robotics, #videos, #image-recognition

Deep Learning with CIFAR-10

Image Classification using CNN

Neural Networks are the programmable patterns that helps to solve complex problems and bring the best achievable output. Deep Learning as we all know is a step ahead of Machine Learning, and it helps to train the Neural Networks for getting the solution of questions unanswered and or improving the solution!

In this article, we will be implementing a Deep Learning Model using CIFAR-10 dataset. Read More

#image-recognition, #python

Detecting Deep-Fake Videos from Phoneme-Viseme Mismatches

Recent advances in machine learning and computer graphics have made it easier to convincingly manipulate video and audio. These so-called deep-fake videos range from complete full-face synthesis and replacement (face-swap), to complete mouth and audio synthesis and replacement (lip-sync), and partial word-based audio and mouth synthesis and replacement. Detection of deep fakes with only a small spatial and temporal manipulation is particularly challenging. We describe a technique to detect such manipulated videos by exploiting the fact that the dynamics of the mouth shape – visemes – are occasionally inconsistent with a spoken phoneme. We focus on the visemes associated with words having the sound M(mama), B(baba), or P(papa) in which the mouth must completely close in order to pronounce these phonemes. We observe that this is not the case in many deep-fake videos. Such phoneme-viseme mismatches can, therefore, be used to detect even spatially small and temporally localized manipulations. We demonstrate the efficacy and robustness of this approach to detect different types of deep-fake videos, including in-the-wild deep fakes. Read More

#fake, #image-recognition

Lidar used to cost $75,000—here’s how Apple brought it to the iPhone

How Apple made affordable lidar with no moving parts for the iPhone.

At Tuesday’s unveiling of the iPhone 12, Apple touted the capabilities of its new lidar sensor. Apple says lidar will enhance the iPhone’s camera by allowing more rapid focus, especially in low-light situations. And it may enable the creation of a new generation of sophisticated augmented reality apps. Read More

#big7, #image-recognition, #robotics

VIVO: Surpassing Human Performance in Novel Object Captioning with Visual Vocabulary Pre-Training

It is highly desirable yet challenging to generate image captions that can describe novel objects which are unseen in caption-labeled training data, a capability that is evaluated in the novel object captioning challenge (nocaps). In this challenge, no additional image-caption training data, other than COCO Captions, is allowed for model training. Thus, conventional Vision-Language Pre-training (VLP) methods cannot be applied. This paper presents VIsual VOcabulary pretraining (VIVO) that performs pre-training in the absence of caption annotations. By breaking the dependency of pairedimage-caption training data in VLP, VIVO can leverage large amounts of paired image-tag data to learn a visual vocabulary. This is done by pre-training a multi-layer Transformer model that learns to align image-level tags with their corresponding image region features. To address the unordered nature of image tags, VIVO uses a Hungarian matching loss with masked tag prediction to conduct pre-training.

We validate the effectiveness of VIVO by fine-tuning the pre-trained model for image captioning. In addition, we perform an analysis of the visual-text alignment inferred by our model. The results show that our model can not only generate fluent image captions that describe novel objects, but also identify the locations of these objects. Our single model has achieved new state-of-the-art results on nocaps and surpassed the human CIDEr score. Read More

#image-recognition, #nlp, #big7

Inventing Virtual Meetings of Tomorrow with NVIDIA AI Research

Read More

NVIDIA Maxine is a fully accelerated platform SDK for developers of video conferencing services to build and deploy AI-powered features that use state-of-the-art models in their cloud. Video conferencing applications based on Maxine can reduce video bandwidth usage down to one-tenth of H.264 using AI video compression, dramatically reducing costs. Read More

#nvidia, #videos, #image-recognition

Toonify Yourself!

Upload a photo and see what you’d look like in an animated movie!

Read More

#image-recognition