Why CAPTCHA Pictures Are So Unbearably Depressing

I hate doing Google’s CAPTCHAs.

Part of it is the sheer hassle of repeatedly identifying objects — traffic lights, staircases, palm trees and buses — just so I can finish a web search. I also don’t like being forced to donate free labor to AI companies to help train their visual-recognition systems.

But a while ago, while numbly clicking on grainy images of fire hydrants, I was struck by another reason:

The images are deeply, overwhelmingly depressing. Read More

#image-recognition

NeRF-VAE: A Geometry Aware 3D Scene Generative Model

We propose NeRF-VAE, a 3D scene generative model that incorporates geometric structure via Neural Radiance Fields (NeRF) and differentiable volume rendering. In contrast to NeRF, our model takes into account shared structure across scenes, and is able to infer the structure of a novel scene— without the need to re-train—using amortized inference. NeRF-VAE’s explicit 3D rendering process further contrasts previous generative models with convolution-based rendering which lacks geometric structure. Our model is a VAE that learns a distribution over radiance fields by conditioning them on a latent scene representation. We show that, once trained, NeRF-VAE is able to infer and render geometrically-consistent scenes from previously unseen 3D environments using very few input images. We further demonstrate that NeRF-VAE generalizes well to out-of-distribution cameras, while convolutional models do not. Finally, we introduce and study an attention-based conditioning mechanism of NeRF-VAE’s decoder, which improves model performance. Read More

#image-recognition

Pre-trained deep learning imagery models update (July 2021)

The amount of imagery that’s collected and disseminated has increased by orders of magnitude over the past couple of years. Deep learning has been instrumental in efficiently extracting and deriving meaningful insights from these massive amounts of imagery. Last October, we released pre-trained geospatial deep learning models, making deep learning more approachable and accessible to a wide spectrum of users.

These models have been pre-trained by Esri on large volumes of data, and can be used as-is, or further fine tuned to your local geography, objects of interest or type of imagery. You no longer need huge volumes of training data and imagery, massive compute resources, or the expertise to train such models yourself. With the pre-trained models, you can bring in the raw data or imagery and extract geographical features at the click of a button. Read More

#image-recognition

Scientists adopt deep learning for multi-object tracking

Their novel framework achieves state-of-the-art performance without sacrificing efficiency in public surveillance tasks

Implementing algorithms that can simultaneously track multiple objects is essential to unlock many applications, from autonomous driving to advanced public surveillance. However, it is difficult for computers to discriminate between detected objects based on their appearance. Now, researchers at the Gwangju Institute of Science and Technology (GIST) have adapted deep learning techniques in a multi-object tracking framework, overcoming short-term occlusion and achieving remarkable performance without sacrificing computational speed. Read More

Read the Paper

#image-recognition, #deep-learning

Google’s Supermodel: DeepMind Perceiver is a step on the road to an AI machine that could process anything and everything

The Perceiver is kind-of a way-station on the way to what Google AI lead Jeff Dean has described as one model that could handle any task, and “learn” faster, with less data.

Arguably one of the premiere events that has brought AI to popular attention in recent years was the invention of the Transformer by Ashish Vaswani and colleagues at Google in 2017. The Transformer led to lots of language programs such as Google’s BERT and OpenAI’s GPT-3 that have been able to produce surprisingly human-seeming sentences, giving the impression machines can write like a person. 

Now, scientists at DeepMind in the U.K., which is owned by Google, want to take the benefits of the Transformer beyond text, to let it revolutionize other material including images, sounds and video, and spatial data of the kind a car records with LiDAR. 

The Perceiver, unveiled this week by DeepMind in a paper posted on arXiv, adapts the Transformer with some tweaks to let it consume all those types of input, and to perform on the various tasks, such as image recognition, for which separate kinds of neural networks are usually developed. Read More

#big7, #nlp, #image-recognition

Alien Dreams: An Emerging Art Scene

In recent months there has been a bit of an explosion in the AI generated art scene.

Ever since OpenAI released the weights and code for their CLIP model, various hackers, artists, researchers, and deep learning enthusiasts have figured out how to utilize CLIP as a an effective “natural language steering wheel” for various generative models, allowing artists to create all sorts of interesting visual art merely by inputting some text – a caption, a poem, a lyric, a word – to one of these models.

For instance inputting “a cityscape at night” produces this cool, abstract-looking depiction of some city lights. Read More

#image-recognition, #nlp, #gans

Zero-Shot Detection via Vision and Language Knowledge Distillation

Zero-shot image classification has made promising progress by training the aligned image and text encoders. The goal of this work is to advance zero-shot object detection, which aims to detect novel objects without bounding box nor mask annotations. We propose ViLD, a training method via Vision and Language knowledge Distillation. We distill the knowledge from a pre-trained zero-shot image classification model (e.g., CLIP [33]) into a two-stage detector (e.g., Mask R-CNN [17]). Our method aligns the region embeddings in the detector to the text and image embeddings inferred by the pre-trained model. We use the text embeddings as the detection classifier, obtained by feeding category names into the pre-trained text encoder. We then minimize the distance between the region embeddings and image embeddings, obtained by feeding region proposals into the pre-trained image encoder. During inference, we include text embeddings of novel categories into the detection classifier for zero-shot detection. We benchmark the performance on LVIS dataset [15] by holding out all rare categories as novel categories. ViLD obtains 16.1 mask APr with a Mask R-CNN (ResNet-50 FPN) for zero-shot detection, outperforming the supervised counterpart by 3.8. The model can directly transfer to other datasets, achieving 72.2 AP50, 36.6 AP and 11.8 AP on PASCAL VOC, COCO and Objects365, respectively. Read More

#image-recognition, #nlp, #gans

VideoGPT: Video Generation using VQ-VAE and Transformers

We present VideoGPT: a conceptually simple architecture for scaling likelihood based generative modeling to natural videos. VideoGPT uses VQ-VAE that learns down sampled discrete latent representations of a raw video by employing 3D convolutions and axial self-attention. A simple GPT-like architecture is then used to autoregressively model the discrete latents using spatio-temporal position encodings. Despite the simplicity in formulation and ease of training, our architecture is able to generate samples competitive with state-of-the-art GAN models for video generation on the BAIR Robot dataset, and generate high fidelity natural images from UCF-101 and Tumbler GIF Dataset (TGIF). We hope our proposed architecture serves as a reproducible reference for a minimalistic implementation of transformer based video generation models. Read More

#gans, #image-recognition

NVIDIA’s Canvas app turns doodles into AI-generated ‘photos’

NVIDIA has launched a new app you can use to paint life-like landscape images — even if you have zero artistic skills and a first grader can draw better than you. The new application is called Canvas, and it can turn childlike doodles and sketches into photorealistic landscape images in real time. It’s now available for download as a free beta, though you can only use it if your machine is equipped with an NVIDIA RTX GPU.

Canvas is powered by the GauGAN AI painting tool, which NVIDIA Research developed and trained using 5 million images. Read More

#gans, #image-recognition

Rembrandt’s The Night Watch painting restored by AI

The missing edges of Rembrandt’s painting The Night Watch have been restored using artificial intelligence.

The canvas, created in 1642, was trimmed in 1715 to fit between two doors at Amsterdam’s city hall.

Since then, 60cm (2ft) from the left, 22cm from the top, 12cm from the bottom and 7cm from the right have been missing.

But computer software has now restored the full painting for the first time in 300 years. Read More

#image-recognition