Can AI Save the Internet from Fake News?

There’s an old proverb that says “seeing is believing.” But in the age of artificial intelligence, it’s becoming increasingly difficult to take anything at face value—literally.

The rise of so-called “deepfakes,” in which different types of AI-based techniques are used to manipulate video content, has reached the point where Congress held its first hearing last month on the potential abuses of the technology. The congressional investigation coincided with the release of a doctored video of Facebook CEO Mark Zuckerberg delivering what appeared to be a sinister speech. Read More

#fake, #image-recognition, #nlp

IBM’s AI automatically generates creative captions for images

Writing photo captions is a monotonous — but necessary — chore begrudgingly undertaken by editors everywhere. Fortunately for them, though, AI might soon be able to handle the bulk of the work. In a paper (“Adversarial Semantic Alignment for Improved Image Captions”) appearing at the 2019 Conference in Computer Vision and Pattern Recognition (CVPR) in Long Beach, California this week, a team of scientists at IBM Research describes a model capable of autonomously crafting diverse, creative, and convincingly humanlike captions. Read More

#gans, #image-recognition, #nlp

Adversarial Semantic Alignment for Improved Image Captions

In this paper we study image captioning as a conditional GAN training, proposing both a context-aware LSTM captioner and co-attentive discriminator, which enforces semantic alignment between images and captions. We empirically focus on the viability of two training methods: Self-critical Sequence Training (SCST) and Gumbel Straight-Through (ST) and demonstrate that SCST shows more stable gradient behavior and improved results over Gumbel ST, even without accessing discriminator gradients directly. We also address the problem of automatic evaluation for captioning models and introduce a new semantic score, and show its correlation to human judgement. As an evaluation paradigm, we argue that an important criterion for a captioner is the ability to generalize to compositions of objects that do not usually cooccur together. To this end, we introduce a small captioned Out of Context (OOC) test set. The OOC set, combined with our semantic score, are the proposed new diagnosis tools for the captioning community. When evaluated on OOC and MS-COCO benchmarks, we show that SCST-based training has a strong performance in both semantic score and human evaluation, promising to be a valuable new approach for efficient discrete GAN training. Read More

#gans, #image-recognition, #nlp

Connecting Touch and Vision via Cross-Modal Prediction

Humans perceive the world using multi-modal sensory inputs such as vision, audition, and touch. In this work, we investigate the cross-modal connection between vision and touch. The main challenge in this cross-domain modeling task lies in the significant scale discrepancy between the two: while our eyes perceive an entire visual scene at once, humans can only feel a small region of an object at any given moment. To connect vision and touch, we introduce new tasks of synthesizing plausible tactile signals from visual inputs as well as imagining how we interact with objects given tactile data as input. To accomplish our goals, we first equip robots with both visual and tactile sensors and collect a large-scale dataset of corresponding vision and tactile image sequences. To close the scale gap, we present a new conditional adversarial model that incorporates the scale and location information of the touch. Human perceptual studies demonstrate that our model can produce realistic visual images from tactile data and vice versa. Finally, we present both qualitative and quantitative experimental results regarding different system designs, as well as visualizing the learned representations of our model. Read More

#gans, #image-recognition

Photo Wake-Up: 3D Character Animation from a Single Photo

We present a method and application for animating a human subject from a single photo. E.g., the character can walk out, run, sit, or jump in 3D. The key contributions of this paper are: 1) an application of viewing and animating humans in single photos in 3D, 2) a novel 2D warping method to deform a posable template body model to fit the person’s complex silhouette to create an animatable mesh, and 3) a method for handling partial self occlusions. We compare to state-of-the-art related methods and evaluate results with human studies. Further, we present an interactive interface that allows re-posing the person in 3D, and an augmented reality setup where the animated 3D person can emerge from the photo into the real world. We demonstrate the method on photos, posters, and art. The project page is at https://grail.cs. washington.edu/projects/wakeup/. Read More

#fake, #image-recognition

Photo Wake-Up: 3D Character Animation from a Single Photo

Read More

#image-recognition, #videos

MIT CSAIL’s AI can visualize objects using touch

Robots that can learn to see by touch are within reach, claim researchers at MIT’s Computer Science and Artificial Intelligence Laboratory. Really. In a newly published paper that’ll be presented next week at the Conference on Computer Vision and Pattern Recognition in Long Beach, California, they describe an AI system capable of generating visual representations of objects from tactile signals, and of predicting tactility from snippets of visual data.

“By looking at the scene, our model can imagine the feeling of touching a flat surface or a sharp edge,” said CSAIL PhD student and lead author on the research Yunzhu Li, who wrote the paper alongside MIT professors Russ Tedrake and Antonio Torralba and MIT postdoc Jun-Yan Zhu. “By blindly touching around, our [AI] model can predict the interaction with the environment purely from tactile feelings. Bringing these two senses together could empower the robot and reduce the data we might need for tasks involving manipulating and grasping objects.” Read More

#gans, #image-recognition

Deep learning model from Lockheed Martin tackles satellite image analysis

A satellite imagery recognition system designed by Lockheed Martin engineers uses open-source deep learning libraries to quickly identify and classify objects or targets in large areas across the world. Company officials say the tool could potentially saving image analysts many man hours categorizing and labeling items within an image.

The model, Global Automated Target Recognition (GATR), runs in the cloud, using Maxar Technologies’ Geospatial Big Data platform (GBDX) to access Maxar’s 100 petabyte satellite imagery library and millions of curated data labels across dozens of categories that expedite the training of deep learning algorithms. Fast GPUs enable GATR to scan a large area very quickly, while deep learning methods automate object recognition and reduce the need for extensive algorithm training. Read More

#image-recognition

Detecting Kissing Scenes in a Database of Hollywood Films

Detecting scene types in a movie can be very useful for application such as video editing, ratings assignment, and personalization. We propose a system for detecting kissing scenes in a movie. This system consists of two components. The first component is a binary classifier that predicts a binary label (i.e. kissing or not) given a features exctracted from both the still frames and audio waves of a one-second segment. The second component aggregates the binary labels for contiguous non-overlapping segments into a set of kissing scenes. We experimented with a variety of 2D and 3D convolutional architectures such as ResNet, DesnseNet, and VGGish and developed a highly accurate kissing detector that achieves a validation F1 score of 0.95 on a diverse database of Hollywood films ranging many genres and spanning multiple decades. The code for this project is available at http://github.com/amirziai/kissing-detector. Read More

#image-recognition, #news-summarization

Does Object Recognition Work for Everyone?

The paper analyzes the accuracy of publicly available object-recognition systems on a geographically diverse dataset. This dataset contains household items and was designed to have a more representative geographical coverage than commonly used image datasets in object recognition. We find that the systems perform relatively poorly on household items that commonly occur in countries with a low household income. Qualitative analyses suggest the drop in performance is primarily due to appearance differences within an object class (e.g., dish soap) and due to items appearing in a different context (e.g., toothbrushes appearing outside of bathrooms). The results of our study suggest that further work is needed to make object-recognition systems work equally well for people across different countries and income levels. Read More

#image-recognition