How Apple made affordable lidar with no moving parts for the iPhone.
At Tuesday’s unveiling of the iPhone 12, Apple touted the capabilities of its new lidar sensor. Apple says lidar will enhance the iPhone’s camera by allowing more rapid focus, especially in low-light situations. And it may enable the creation of a new generation of sophisticated augmented reality apps. Read More
Tag Archives: Image Recognition
VIVO: Surpassing Human Performance in Novel Object Captioning with Visual Vocabulary Pre-Training
It is highly desirable yet challenging to generate image captions that can describe novel objects which are unseen in caption-labeled training data, a capability that is evaluated in the novel object captioning challenge (nocaps). In this challenge, no additional image-caption training data, other than COCO Captions, is allowed for model training. Thus, conventional Vision-Language Pre-training (VLP) methods cannot be applied. This paper presents VIsual VOcabulary pretraining (VIVO) that performs pre-training in the absence of caption annotations. By breaking the dependency of pairedimage-caption training data in VLP, VIVO can leverage large amounts of paired image-tag data to learn a visual vocabulary. This is done by pre-training a multi-layer Transformer model that learns to align image-level tags with their corresponding image region features. To address the unordered nature of image tags, VIVO uses a Hungarian matching loss with masked tag prediction to conduct pre-training.
We validate the effectiveness of VIVO by fine-tuning the pre-trained model for image captioning. In addition, we perform an analysis of the visual-text alignment inferred by our model. The results show that our model can not only generate fluent image captions that describe novel objects, but also identify the locations of these objects. Our single model has achieved new state-of-the-art results on nocaps and surpassed the human CIDEr score. Read More
Inventing Virtual Meetings of Tomorrow with NVIDIA AI Research
NVIDIA Maxine is a fully accelerated platform SDK for developers of video conferencing services to build and deploy AI-powered features that use state-of-the-art models in their cloud. Video conferencing applications based on Maxine can reduce video bandwidth usage down to one-tenth of H.264 using AI video compression, dramatically reducing costs. Read More
#nvidia, #videos, #image-recognitionToonify Yourself!
Upload a photo and see what you’d look like in an animated movie!

AI ‘resurrects’ 54 Roman emperors, in stunningly lifelike images
Computational Needs for Computer Vision (CV) in AI and ML Systems
Computer vision (CV) is a major task for modern Artificial Intelligence (AI) and Machine Learning (ML) systems. It’s accelerating nearly every domain in the tech industry enabling organizations to revolutionize the way machines and business systems work.
… In this article, we briefly show you the common challenges associated with a CV system when it employs modern ML algorithms. Read More
YouTube’s Plot to Silence Conspiracy Theories
From flat-earthers to QAnon to Covid quackery, the video giant is awash in misinformation. Can AI keep the lunatic fringe from going viral?
Mark Sargent saw instantly that his situation had changed for the worse. A voluble, white-haired 52-year-old, Sargent is a flat-earth evangelist who lives on Whidbey Island in Washington state and drives a Chrysler with the vanity plate “ITSFLAT.” But he’s well known around the globe, at least among those who don’t believe they are living on one. That’s thanks to YouTube, which was the on-ramp both to his flat-earth ideas and to his subsequent international stardom.
… Crucial to his success, he says, was YouTube’s recommendation system. …For four years, Sargent’s flat-earth videos got a steady stream of traffic from YouTube’s algorithms. Then, in January 2019, the flow of new viewers suddenly slowed to a trickle. Read More
Library of Congress Launches New Tool to Search Historical Newspaper Images
The public can now explore more than 1.5 million historical newspaper images online and free of charge. The latest machine learning experience from Library of Congress Labs, Newspaper Navigator allows users to search visual content in American newspapers dating 1789-1963.
… Through the creative ingenuity of Innovator in Residence Benjamin Lee and advances in machine learning, Newspaper Navigator now makes images in the newspapers searchable by enabling users to search by visual similarity. Read More
DeepFaceDrawing: Deep Generation of Face Images from Sketches
Recent deep image-to-image translation techniques allow fast generation of face images from freehand sketches. However, existing solutions tend to overfit to sketches, thus requiring professional sketches or even edge maps as input. To address this issue, our key idea is to implicitly model the shape space of plausible face images and synthesize a face image in this space to approximate an input sketch. We take a local-to-global approach. We first learn feature embeddings of key face components, and push corresponding parts of input sketches towards underlying component manifolds defined by the feature vectors of face component samples. We also propose another deep neural network to learn the mapping from the embedded component features to realistic images with multi-channel feature maps as intermediate results to improve the information flow. Our method essentially uses input sketches as soft constraints and is thus able to produce high-quality face images even from rough and/or incomplete sketches. Our tool is easy to use even for non-artists, while still supporting fine-grained control of shape details. Both qualitative and quantitative evaluations show the superior generation ability of our system to existing and alternative solutions. The usability and expressiveness of our system are confirmed by a user study. Read More
Full-Body Awareness from Partial Observations
There has been great progress in human 3D mesh recovery and great interest in learning about the world from consumer video data. Unfortunately current methods for 3D human mesh recovery work rather poorly on consumer video data, since on the Internet, unusual camera viewpoints and aggressive truncations are the norm rather than a rarity. We study this problem and make a number of contributions to address it: (i) we propose a simple but highly effective self-training framework that adapts human 3D mesh recovery systems to consumer videos and demonstrate its application to two recent systems; (ii) we introduce evaluation protocols and keypoint annotations for 13K frames across four consumer video datasets for studying this task, including evaluations on out-of-image keypoints; and (iii) we show that our method substantially improves PCK and human-subject judgments compared to baselines, both on test videos from the dataset it was trained on, as well as on three other datasets without further adaptation. Read More