Autonomous driving is considered to be the holy grail of the automotive industry and has been promised to us for quite a long time already. If I recall the slides from a 2013 Bosch presentation, we should’ve been all passengers in our cars a year ago. Back then, seven years seemed like a reasonable time frame but, health crisis aside, we are nowhere near fully-autonomous driving, or Level 5 (L5) autonomy as the industry calls it.
Sure, Tesla calls its assistance suite “Autopilot” or even “Full Self-Driving,” but it’s just a deceptive trade name for a system that is only capable of L2 autonomy. This means that the car cannot be trusted with your life and Tesla does not assume responsibility for whatever mischiefs the car might be doing. Read More
Tag Archives: Image Recognition
Synthesia raises $50M to leverage synthetic avatars for corporate training and more
Because every doc should be a presentation, and every presentation should be a video?
Synthesia, a startup using AI to create synthetic videos, is walking a fine, but thus far prosperous, line between being creepy and being pretty freakin’ cool.
…Synthesia allows anyone to turn text or a slide deck presentation into a video, complete with a talking avatar. Customers can leverage existing avatars, created from the performance of actors, or create their own in minutes by uploading some video. Users also can upload a recording of their voice, which can be transformed to say just about anything under the sun. Read More
Artificial intelligence that understands object relationships
A new machine-learning model could enable robots to understand interactions in the world in the way humans do.
MIT researchers have developed a machine learning model that understands the underlying relationships between objects in a scene and can generate accurate images of scenes from text descriptions. Read More
‘Paint Me a Picture’: NVIDIA Research Shows GauGAN AI Art Demo Now Responds to Words
GauGAN2 uses a deep learning model that turns a simple written phrase, or sentence, into a photorealistic masterpiece.
A picture worth a thousand words now takes just three or four words to create, thanks to GauGAN2, the latest version of NVIDIA Research’s wildly popular AI painting demo.
The deep learning model behind GauGAN allows anyone to channel their imagination into photorealistic masterpieces — and it’s easier than ever. Simply type a phrase like “sunset at a beach” and AI generates the scene in real time. Add an additional adjective like “sunset at a rocky beach,” or swap “sunset” to “afternoon” or “rainy day” and the model, based on generative adversarial networks, instantly modifies the picture.
With the press of a button, users can generate a segmentation map, a high-level outline that shows the location of objects in the scene. From there, they can switch to drawing, tweaking the scene with rough sketches using labels like sky, tree, rock and river, allowing the smart paintbrush to incorporate these doodles into stunning images. Read More
Face Recognition Vendor Test (FRVT) Ongoing
In cooperation with IARPA, National Institute of Standards and Technology (NIST) is currently running three challenges related to processing of unconstrained in-the-wild face images. The Face Recognition Vendor Test (FRVT) is an ongoing evaluation of face recognition algorithms applied to large image databases sequestered at NIST. Algorithms may be submitted to NIST at any time, and results will be posted when ready, usually within two weeks. Homepage
Unsupervised Learning of Visual 3D Keypoints for Control
Learning sensorimotor control policies from high dimensional images crucially relies on the quality of the underlying visual representations. Prior works show that structured latent space such as visual keypoints often outperforms unstructured representations for robotic control. However, most of these representations, whether structured or unstructured are learned in a 2D space even though the control tasks are usually performed in a 3D environment. In this work, we propose a framework to learn such a 3D geometric structure directly from images in an end-to-end unsupervised manner. The input images are embedded into latent 3D keypoints via a differentiable encoder which is trained to optimize both a multi-view consistency loss and downstream task objective. These discovered D keypoints tend to meaningfully capture robot joints as well as object movements in a consistent manner across both time and 3D space. The proposed approach outperforms prior state-of-art methods across a variety of reinforcement learning benchmarks. Read More
Generating Animations From Audio With NVIDIA’s Deep Learning Tech
Check out a tool in beta called Omniverse Audio2Face that lets you quickly generate new animations.
In case you missed the news, NVIDIA has a tool in beta that lets you quickly and easily generate expressive facial animation from just an audio source using the team’s deep learning-based technology. The Audio2Face tool allows users to simplify the animation of 3D characters for a game, film, real-time digital assistants, and other projects. The toolkit lets you run the results live or bake them out. Read More
Designing effective traditional and deep learning-based inspection systems for machine vision applications
When best practices are followed, machine vision and deep learning-based imaging systems are capable of effective visual inspection and will improve efficiency, increase throughput, and drive revenue.
For decades, machine vision technology has performed automated inspection tasks—including defect detection, flaw analysis, assembly verification, sorting, and counting—in industrial settings. Recent computer vision software advances and processing techniques have further enhanced the capabilities of these imaging systems in new and expanding uses. The imaging system itself remains a critically important vision component, yet its role and execution can be underestimated or misunderstood.
Without a well-designed and properly installed imaging system, software will struggle to reliably detect defects. For example, even though the imaging setup in Figure 1 (left) displays an attractive image of a gear, only the image on the right clearly shows a dent. When best practices are followed, machine vision and deep learning-based imaging systems are capable of effective visual inspection and will improve efficiency, increase throughput, and drive revenue. This article takes an in-depth dive into the best practices for iterative design and provides a roadmap for success for designing each type of system. Read More
Google AI Introduces ‘WIT’, A Wikipedia-Based Image Text Dataset For Multimodal Multilingual Machine Learning
Image and text datasets are widely used in many machine learning applications. To model the relationship between images and text, most multimodal Visio-linguistic models today rely on large datasets. Historically, these datasets were created by either manually captioning images or crawling the web and extracting the alt-text as the caption. While the former method produces higher-quality data, the intensive manual annotation process limits the amount of data produced. The automated extraction method can result in larger datasets. However, it requires either heuristics and careful filtering to ensure data quality or scaling-up models to achieve robust performance.
To overcome these limitations, Google research team created a high-quality, large-sized, multilingual dataset called the Wikipedia-Based Image Text (WIT) Dataset. It is created by extracting multiple text selections associated with an image from Wikipedia articles and Wikimedia image links. Read More
GANs N’ Roses: Stable, Controllable, Diverse Image to Image Translation (works for videos too!)
We show how to learn a map that takes a content code, derived from a face image, and a randomly chosen style code to an anime image. We derive an adversarial loss from our simple and effective definitions of style and content. This adversarial loss guarantees the map is diverse – a very wide range of anime can be produced from a single content code. Under plausible assumptions, the map is not just diverse, but also correctly represents the probability of an anime, conditioned on an input face. In contrast, current multimodal generation procedures cannot capture the complex styles that appear in anime. Extensive quantitative experiments support the idea the map is correct. Extensive qualitative results show that the method can generate a much more diverse range of styles than SOTA comparisons. Finally, we show that our formalization of content and style allows us to perform video to video translation without ever training on videos Read More
#gans, #image-recognition