BARF: Bundle-Adjusting Neural Radiance Fields

Neural Radiance Fields (NeRF) [30] have recently gained a surge of interest within the computer vision community for its power to synthesize photorealistic novel views of real-world scenes. One limitation of NeRF, however, is its requirement of accurate camera poses to learn the scene representations. In this paper, we propose Bundle-Adjusting Neural Radiance Fields (BARF) for training NeRF from imperfect (or even unknown) camera poses — the joint problem of learning neural 3D representations and registering cam-era frames. We establish a theoretical connection to classical image alignment and show that coarse-to-fine registration is also applicable to NeRF. Furthermore, we show that naïvely applying positional encoding in NeRF has a negative impact on registration with a synthesis-based objective. Experiments on synthetic and real-world data show that BARF can effectively optimize the neural scene representations and re-solve large camera pose misalignment at the same time. This enables view synthesis and localization of video sequences from unknown camera poses, opening up new avenues for visual localization systems (e.g. SLAM) and potential applications for dense 3D mapping and reconstruction. Read More

#image-recognition

Learning Transferable Visual Models From Natural Language Supervision

State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories. This restricted form of supervision limits their generality and usability since additional labeled data is needed to specify any other visual concept. Learning directly from raw text about images is a promising alternative which leverages a much broader source of supervision.We demonstrate that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet. After pre-training, natural language is used to reference learned visual concepts (or describe new ones) enabling zero-shot transfer of the model to downstream tasks. We study the performance of this approach by benchmarking on over 30 different existing computer vision datasets, spanning tasks such as OCR, action recognition in videos, geo-localization, and many types of fine-grained object classification. The model transfers non-trivially to most tasks and is often competitive with a fully supervised baseline without the need for any dataset specific training. For instance, we match the ac-curacy of the original ResNet-50 on ImageNet zero-shot without needing to use any of the 1.28million training examples it was trained on. Read More

#image-recognition, #nlp

Towards General Purpose Vision Systems

A special purpose learning system assumes knowledge of admissible tasks at design time. Adapting such a system to unforeseen tasks requires architecture manipulation such as adding an output head for each new task or dataset. In this work, we propose a task-agnostic vision-language system that accepts an image and a natural language task description and outputs bounding boxes, confidences, and text. The system supports a wide range of vision tasks such as classification, localization, question answering, captioning, and more. We evaluate the system’s ability to learn multiple skills simultaneously, to perform tasks with novel skill-concept combinations, and to learn new skills efficiently and without forgetting. Read More

#image-recognition

Projected Distribution Loss for Image Enhancement

Features obtained from object recognition CNNs have been widely used for measuring perceptual similarities between images. Such differentiable metrics can be used as perceptual learning losses to train image enhancement models. However, the choice of the distance function between input and target features may have a consequential impact on the performance of the trained model. While using the norm of the difference between extracted features leads to limited hallucination of details, measuring the distance between distributions of features may generate more textures; yet also more unrealistic details and artifacts. In this paper, we demonstrate that aggregating 1D-Wasserstein distances between CNN activations is more reliable than the existing approaches, and it can significantly improve the perceptual performance of enhancement models. More explicitly, we show that in imaging applications such as denoising, super-resolution, demosaicing, deblurring and JPEG artifact removal, the proposed learning loss outperforms the current state-of-the-art on reference-based perceptual losses. This means that the proposed learning loss can be plugged into different imaging frameworks and produce perceptually realistic results. Read More

#image-recognition

Google’s Cinematic Moments is like Apple’s Live Photos — but a lot creepier

Google Photos will see several new features later this year — but some could be a bit creepier than others.

Arriving this summer, Cinematic Moments is a Google Photos tool that yields similar results to Apple’s Live Photos. The difference is that Cinematic Moments uses artificial intelligence (AI) to fill in the gaps of a few photos, rather than record short video, to produce in-motion media.

With a handful of still images, Cinematic Moments can create a complete and animated action shot. It uses neural networks to synthesize the moment, materializing frames from thin air, practically. Read More

#big7, #image-recognition

Intel Researchers Give ‘GTA V’ Photorealistic Graphics, Similar Techniques Could Do the Same for VR

Read More

#image-recognition, #videos

Crisscrossed Captions: Extended Intramodal and Intermodal Semantic Similarity Judgments for MS-COCO

By supporting multi-modal retrieval training and evaluation, image captioning datasets have spurred remarkable progress on representation learning.Unfortunately, datasets have limited cross-modal associations: images are not paired with other images, captions are only paired with other captions of the same image, there are no negative associations and there are missing positive cross-modal associations. This undermines research into how inter-modality learning impacts intra-modality tasks. We address this gap with Crisscrossed Captions (CxC), an extension of the MS-COCO dataset with human semantic similarity judgments for267,095intra- and inter-modality pairs. We report baseline results on CxC for strong existing unimodal and multi-modal models. We also evaluate a multitask dual encoder trained on both image-caption and caption-caption pairs that crucially demonstrates CxC’s value for measuring the influence of intra- and inter-modality learning. Read More

#image-recognition, #nlp

A company is using artificial intelligence to insert new products and ads into content, including old movies

Mirriad uses AI technology to insert products and ads into new and old content, posing a threat to traditional advertising. The technology identifies places within content where ads or products could be inserted, as well as where the viewer’s attention drifts within any given still. It’s the in-content solution for the Chinese giant Tencent and plans to work with streaming platforms, where it could provide a solution for advertisers to make revenue amid a widespread shift to streaming. Read More

#image-recognition

First ship controlled by artificial intelligence prepares for maiden voyage

The “Mayflower 400”, the world’s first intelligent ship, bobs gently in a light swell as it stops its engines in Plymouth Sound, off England’s southwest coast, before self-activating a hydrophone designed to listen to whales.

The 50-foot (15-metre) trimaran, which weighs nine tonnes and navigates with complete autonomy, is preparing for a transatlantic voyage. Read More

#image-recognition

Fingerspelling

Online game to learn sign language. Fingerspelling.xyz combines advanced hand recognition technology with machine learning to teach sign language. Read More

#image-recognition, #nlp