BERT-ATTACK: Adversarial Attack Against BERT Using BERT

Adversarial attacks for discrete data (such as texts) have been proved significantly more challenging than continuous data (such as images) since it is difficult to generate adversarial samples with gradient-based methods. Current successful attack methods for texts usually adopt heuristic replacement strategies on the character or word level, which remains challenging to find the optimal solution in the massive space of possible combinations of replacements while preserving semantic consistency and language fluency. In this paper, we propose BERT-Attack, a high-quality and effective method to generate adversarial samples using pre-trained masked language models exemplified by BERT. We turn BERT against its fine-tuned models and other deep neural models in downstream tasks so that we can successfully mislead the target models to predict incorrectly. Our method outperforms state-of-theart attack strategies in both success rate and perturb percentage, while the generated adversarial samples are fluent and semantically preserved. Also, the cost of calculation is low, thus possible for large-scale generations. The code is available at https://github.com/LinyangLee/BERT-Attack Read More

#adversarial

Planting Undetectable Backdoors in Machine Learning Models

Given the computational cost and technical expertise required to train machine learning models, users may delegate the task of learning to a service provider. Delegation of learning has clear benefits, and at the same time raises serious concerns of trust. This work studies possible abuses of power by untrusted learners. We show how a malicious learner can plant an undetectable backdoor into a classifier. On the surface, such a backdoored classifier behaves normally, but in reality, the learner maintains a mechanism for changing the classification of any input, with only a slight perturbation. Importantly, without the appropriate “backdoor key,” the mechanism is hidden and cannot be detected by any computationally-bounded observer. We demonstrate two frameworks for planting undetectable backdoors, with incomparable guarantees.

Our construction of undetectable backdoors also sheds light on the related issue of robustness to adversarial examples. In particular, by constructing undetectable backdoor for an “adversariallyrobust” learning algorithm, we can produce a classifier that is indistinguishable from a robust classifier, but where every input has an adversarial example! In this way, the existence of undetectable backdoors represent a significant theoretical roadblock to certifying adversarial robustness. Read More

#adversarial, #trust

AI Risk Management Framework: Initial Draft

This initial draft of the Artificial Intelligence Risk Management Framework (AI RMF, or Framework) builds on the concept paper released in December 2021 and incorporates the feedback received. The AI RMF is intended for voluntary use in addressing risks in the design, development, use, and evaluation of AI products, services, and systems.

AI research and deployment is evolving rapidly. For that reason, the AI RMF and its companion documents will evolve over time. When AI RMF 1.0 is issued in January 2023, NIST, working with stakeholders, intends to have built out the remaining sections to reflect new knowledge, awareness, and practices.

Part I of the AI RMF sets the stage for why the AI RMF is important and explains its intended use and audience. Part II includes the AI RMF Core and Profiles. Part III includes a companion Practice Guide to assist in adopting the AI RMF.

That Practice Guide which will be released for comment includes additional examples and practices that can assist in using the AI RMF. The Guide will be part of a NIST AI Resource Center that is being established. Read More

#adversarial, #nist

One Thing to Fool them All: Generating Interpretable, Universal, andPhysically-Realizable Adversarial Features

It is well understood that modern deep networks are vulnerable to adversarial attacks. However, conventional attack methods fail to produce adversarial perturbations that are intelligible to humans, and they pose limited threats in the physical world. To study feature-class associations in networks and better understand their vulnerability to attacks in the real world, we develop feature-level adversarial perturbations using deep image generators and a novel optimization objective. We term these feature-fool attacks. We show that they are versatile and use them to generate targeted feature-level attacks at the ImageNet scale that are simultaneously interpretable, universal to any source image, and physically-realizable. These attacks reveal spurious, semantically-describable feature/class associations that can be exploited by novel combinations of objects. We use them to guide the design of “copy/paste” adversaries in which one natural image is pasted into another to cause a targeted misclassification. Read More

#adversarial

Dodging Attack Using Carefully Crafted Natural Makeup

Deep learning face recognition models are used by state-of-the-art surveillance systems to identify individuals passing through public areas (e.g., airports). Previous studies have demonstrated the use of adversarial machine learning (AML) attacks to successfully evade identification by such systems, both in the digital and physical domains. Attacks in the physical domain, however, require significant manipulation to the human participant’s face, which can raise suspicion by human observers (e.g. airport security officers). In this study, we present a novel black-box AML attack which carefully crafts natural makeup, which, when applied on a human participant, prevents the participant from being identified by facial recognition models. We evaluated our proposed attack against the ArcFace face recognition model, with 20 participants in a real-world setup that includes two cameras, different shooting angles, and different lighting conditions. The evaluation results show that in the digital domain, the face recognition system was unable to identify all of the participants, while in the physical domain, the face recognition system was able to identify the participants in only 1.22% of the frames (compared to 47.57% without makeup and 33.73% with random natural makeup), which is below a reasonable threshold of a realistic operational environment. Read More

#adversarial, #surveillance

EvilModel: Hiding Malware Inside of Neural Network Models

Delivering malware covertly and detection-evadingly is critical to advanced malware campaigns. In this paper, we present a method that delivers malware covertly and detection evadingly through neural network models. Neural network models are poorly explainable and have a good generalization ability. By embedding malware into the neurons, malware can be delivered covertly with minor or even no impact on the performance of neural networks. Meanwhile, since the structure of the neural network models remains unchanged, they can pass the security scan of antivirus engines. Experiments show that 36.9MB of malware can be embedded into a 178MB-AlexNet model within 1% accuracy loss, and no suspicious are raised by antivirus engines in VirusTotal, which verifies the feasibility of this method. With the widespread application of artificial intelligence, utilizing neural networks becomes a forwarding trend of malware. We hope this work could provide a referenceable scenario for the defense on neural network-assisted attacks. Read More

#adversarial, #cyber

Researchers Hid Malware Inside an AI’s ‘Neurons’ And It Worked Scarily Well

In a proof-of-concept, researchers reported they could embed malware in up to half of an AI model’s nodes and still obtain very high accuracy.

Neural networks could be the next frontier for malware campaigns as they become more widely used, according to a new study. 

According to the study, which was posted to the arXiv preprint server on Monday, malware can be embedded directly into the artificial neurons that make up machine learning models in a way that keeps them from being detected. The neural network would even be able to continue performing its set tasks normally. Read More

#adversarial, #cyber

Poison in the Well

Securing the Shared Resources of Machine Learning

Progress in machine learning depends on trust. Researchers often place their advances in a public well of shared resources, and developers draw on those to save enormous amounts of time and money. Coders use the code of others, harnessing common tools rather than reinventing the wheel. Engineers use systems developed by others as a basis for their own creations. Data scientists draw on large public datasets to train machines to carry out routine tasks, such as image recognition, autonomous driving, and text analysis. Machine learning has accelerated so quickly and proliferated so widely largely because of this shared well of tools and data.

But the trust that so many place in these common resources is a security weakness. Poison in this well can spread, affecting the products that draw from it. Right now, it is hard to verify that the well of machine learning is free from malicious interference. In fact, there are good reasons to be worried. Attackers can poison the well’s three main resources—machine learning tools, pretrained machine learning models, and datasets for training—in ways that are extremely difficult to detect. Read More

#adversarial, #cyber

Attackers can elicit ‘toxic behavior’ from AI translation systems, study finds

Neural machine translation (NMT), or AI that can translate between languages, is in widespread use today, owing to its robustness and versatility. But NMT systems can be manipulated if provided prompts containing certain words, phrases, or alphanumeric symbols. For example, in 2015 Google had to fix a bug that caused Google Translate to offer homophobic slurs like “poof” and “queen” to those translating the word “gay” from English into Spanish, French, or Portuguese. In another glitch, Reddit users discovered that typing repeated words like “dog” into Translate and asking the system for a translation to English yielded “doomsday predictions.”

A new study from researchers at the University of Melbourne, Facebook, Twitter, and Amazon suggests NMT systems are even more vulnerable than previously believed. By focusing on a process called back-translation, an attacker could elicit “toxic behavior” from a system by inserting only a few words or sentences into the dataset used to train the underlying model, the coauthors found. Read More

#adversarial, #nlp

Markpainting: Adversarial Machine Learning meets Inpainting

Inpainting is a learned interpolation technique that is based on generative modeling and used to populate masked or missing pieces in an image; it has wide applications in picture editing and retouching. Recently, inpainting started being used for watermark removal, raising concerns. In this paper we study how to manipulate it using our markpainting technique. First, we show how an image owner with access to an inpainting model can augment their image in such a way that any attempt to edit it using that model will add arbitrary visible information. We find that we can target multiple different models simultaneously with our technique. This can be designed to reconstitute a watermark if the editor had been trying to remove it. Second, we show that our markpainting technique is transferable to models that have different architectures or were trained on different datasets, so watermarks created using it are difficult for adversaries to remove. Markpainting is novel and can be used as a manipulation alarm that becomes visible in the event of inpainting. Read More

#adversarial, #image-recognition