In cooperation with IARPA, National Institute of Standards and Technology (NIST) is currently running three challenges related to processing of unconstrained in-the-wild face images. The Face Recognition Vendor Test (FRVT) is an ongoing evaluation of face recognition algorithms applied to large image databases sequestered at NIST. Algorithms may be submitted to NIST at any time, and results will be posted when ready, usually within two weeks. Homepage
Tag Archives: Image Recognition
Unsupervised Learning of Visual 3D Keypoints for Control
Learning sensorimotor control policies from high dimensional images crucially relies on the quality of the underlying visual representations. Prior works show that structured latent space such as visual keypoints often outperforms unstructured representations for robotic control. However, most of these representations, whether structured or unstructured are learned in a 2D space even though the control tasks are usually performed in a 3D environment. In this work, we propose a framework to learn such a 3D geometric structure directly from images in an end-to-end unsupervised manner. The input images are embedded into latent 3D keypoints via a differentiable encoder which is trained to optimize both a multi-view consistency loss and downstream task objective. These discovered D keypoints tend to meaningfully capture robot joints as well as object movements in a consistent manner across both time and 3D space. The proposed approach outperforms prior state-of-art methods across a variety of reinforcement learning benchmarks. Read More
Generating Animations From Audio With NVIDIA’s Deep Learning Tech
Check out a tool in beta called Omniverse Audio2Face that lets you quickly generate new animations.
In case you missed the news, NVIDIA has a tool in beta that lets you quickly and easily generate expressive facial animation from just an audio source using the team’s deep learning-based technology. The Audio2Face tool allows users to simplify the animation of 3D characters for a game, film, real-time digital assistants, and other projects. The toolkit lets you run the results live or bake them out. Read More
Designing effective traditional and deep learning-based inspection systems for machine vision applications
When best practices are followed, machine vision and deep learning-based imaging systems are capable of effective visual inspection and will improve efficiency, increase throughput, and drive revenue.
For decades, machine vision technology has performed automated inspection tasks—including defect detection, flaw analysis, assembly verification, sorting, and counting—in industrial settings. Recent computer vision software advances and processing techniques have further enhanced the capabilities of these imaging systems in new and expanding uses. The imaging system itself remains a critically important vision component, yet its role and execution can be underestimated or misunderstood.
Without a well-designed and properly installed imaging system, software will struggle to reliably detect defects. For example, even though the imaging setup in Figure 1 (left) displays an attractive image of a gear, only the image on the right clearly shows a dent. When best practices are followed, machine vision and deep learning-based imaging systems are capable of effective visual inspection and will improve efficiency, increase throughput, and drive revenue. This article takes an in-depth dive into the best practices for iterative design and provides a roadmap for success for designing each type of system. Read More
Google AI Introduces ‘WIT’, A Wikipedia-Based Image Text Dataset For Multimodal Multilingual Machine Learning
Image and text datasets are widely used in many machine learning applications. To model the relationship between images and text, most multimodal Visio-linguistic models today rely on large datasets. Historically, these datasets were created by either manually captioning images or crawling the web and extracting the alt-text as the caption. While the former method produces higher-quality data, the intensive manual annotation process limits the amount of data produced. The automated extraction method can result in larger datasets. However, it requires either heuristics and careful filtering to ensure data quality or scaling-up models to achieve robust performance.
To overcome these limitations, Google research team created a high-quality, large-sized, multilingual dataset called the Wikipedia-Based Image Text (WIT) Dataset. It is created by extracting multiple text selections associated with an image from Wikipedia articles and Wikimedia image links. Read More
GANs N’ Roses: Stable, Controllable, Diverse Image to Image Translation (works for videos too!)
We show how to learn a map that takes a content code, derived from a face image, and a randomly chosen style code to an anime image. We derive an adversarial loss from our simple and effective definitions of style and content. This adversarial loss guarantees the map is diverse – a very wide range of anime can be produced from a single content code. Under plausible assumptions, the map is not just diverse, but also correctly represents the probability of an anime, conditioned on an input face. In contrast, current multimodal generation procedures cannot capture the complex styles that appear in anime. Extensive quantitative experiments support the idea the map is correct. Extensive qualitative results show that the method can generate a much more diverse range of styles than SOTA comparisons. Finally, we show that our formalization of content and style allows us to perform video to video translation without ever training on videos Read More
#gans, #image-recognitionToshiba Claims To Have Created World’s Most Accurate Visual Question Answering AI
Toshiba Corporation claims to have developed the world’s most accurate and highly versatile Visual Question Answering (VQA) AI that can recognise not only people and objects but also colours, shapes, appearances and background details in images.
The AI overcomes the difficulty of answering questions on the positioning and appearance of people and objects and possesses the ability to learn the information required to handle a wide range of questions and answers.
Toshiba presented the technology at ICANN2021, the international conference for neural networks, on 14 September. Read More
A way to spot computer-generated faces
A small team of researchers from The State University of New York at Albany, the State University of New York at Buffalo and Keya Medical has found a common flaw in computer-generated faces by which they can be identified. The group has written a paper describing their findings and have uploaded them to the arXiv preprint server.
…The researchers note that in many cases, users can simply zoom in on the eyes of a person they suspect may not be real to spot the pupil irregularities. They also note that it would not be difficult to write software to spot such errors and for social media sites to use it to remove such content. Unfortunately, they also note that now that such irregularities have been identified, the people creating the fake pictures can simply add a feature to ensure the roundness of pupils. Read More
VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning
Recent self-supervised methods for image representation learning are based on maximizing the agreement between embedding vectors from different views of the same image. A trivial solution is obtained when the encoder outputs constant vectors. This collapse problem is often avoided through implicit biases in the learning architecture, that often lack a clear justification or interpretation. In this paper, we introduce VICReg (Variance-Invariance-Covariance Regularization), a method that explicitly avoids the collapse problem with a simple regularization term on the variance of the embeddings along each dimension individually. VICReg combines the variance term with a decorrelation mechanism based on redundancy reduction and covariance regularization, and achieves results on par with the state of the art on several downstream tasks. In addition, we show that incorporating our new variance term into other methods helps stabilize the training and leads to performance improvements. Read More
#image-recognition, #self-supervisedESRGAN: Enhanced Super-Resolution Generative Adversarial Networks
The Super-Resolution Generative Adversarial Network (SRGAN) [1] is a seminal work that is capable of generating realistic textures during single image super-resolution. However, the hallucinated details are often accompanied with unpleasant artifacts. To further enhance the visual quality, we thoroughly study three key components of SRGAN – network architecture, adversarial loss and perceptual loss, and improve each of them to derive an Enhanced SRGAN (ESRGAN). In particular, we introduce the Residual-in-Residual Dense Block (RRDB) without batch normalization as the basic network building unit. Moreover, we borrow the idea from relativistic GAN [2] to let the discriminator predict relative realness instead of the absolute value. Finally, we improve the perceptual loss by using the features before activation, which could provide stronger supervision for brightness consistency and texture recovery. Benefiting from these improvements, the proposed ESRGAN achieves consistently better visual quality with more realistic and natural textures than SRGAN and won the first place in the PIRM2018-SR Challenge [3]. Read More
#gans, #image-recognition