On the Dangers of Stochastic Parrots:Can Language Models Be Too Big?

The past 3 years of work in NLP have been characterized by the development and deployment of ever larger language models, especially for English. BERT, its variants, GPT-2/3, and others, most recently Switch-C, have pushed the boundaries of the possible both through architectural innovations and through sheer size. Using these pretrained models and the methodology of fine-tuning them for specific tasks, researchers have extended the state of the art on a wide array of tasks as measured by leaderboards on specific benchmarks for English. In this paper, we take a step back and ask: How big is too big? What are the possible risks associated with this technology and what paths are available for mitigating those risks? We provide recommendations including weighing the environmental and financial costs first, investing resources into curating and carefully documenting datasets rather than ingesting everything on the web, carrying out pre-development exercises evaluating how the planned approach fits into research and development goals and supports stakeholder values, and encouraging research directions beyond ever larger language models. Read More

#nlp

Why Some Models Leak Data

Machine learning models use large amounts of data, some of which can be sensitive. If they’re not trained correctly, sometimes that data is inadvertently revealed.

… Models of real world data are often quite complex—this can improve accuracy, but makes them more susceptible to unexpectedly leaking information. Medical models have inadvertently revealed patients’ genetic markers. Language models have memorized credit card numbers. Faces can even be reconstructed from image models.

… Training models with differential privacy stops the training data from leaking by limiting how much the model can learn from any one data point. Differentially private models are still at the cutting edge of research, but they’re being packaged into machine learning frameworks, making them much easier to use. When it isn’t possible to train differentially private models, there are also tools that can measure how much data is the model memorizing.  Read More

#adversarial, #privacy, #model-attacks

Chinese Technology Platforms Operating in the United States: Assessing the Threat

… Going forward, the US government has an urgent need for smart policies and practices to respond to China’s
growing tech sector and the spread of China-controlled platforms. The Biden administration will have to decide what to do about TikTok and WeChat. It also will need to develop a broader US strategy for addressing the range of security risks (e.g., economic, national security, cybersecurity) and threats to civil liberties posed by the spread of China-developed and -controlled technologies.

This report seeks to contribute to these efforts by suggesting a comprehensive framework for understanding and assessing the risks posed by Chinese technology platforms in the United States. It is the product of a working group convened by the Tech, Law & Security Program at American University Washington College of Law and the National Security, Technology, and Law Working Group at the Hoover Institution at Stanford University. Read More

#china-vs-us

How the Kremlin Uses Agenda Setting to Paint Democracy in Panic

Since November 2020, the world has watched the presidential transition in the United States with unease. After a violent mob of Trump supporters stormed the U.S. Capitol on Jan. 6 in an effort to overturn Joe Biden’s election, headlines around the world questioned, for the first time, whether a democratic transfer of power would occur as expected. These reports also included the well-documented risks of violence that might occur at President Biden’s inauguration. 

…But Russian media tell a different story. By flooding the front pages of its media with headlines of continued unrest, opposition criticism and government suppression, the Kremlin has pulled out an old playbook in its efforts to sway global opinion against the promise of Western liberalism. And these tactics, compared to the shadowy bots and trolls we’ve grown to associate with Russian influence operations, may prove even tougher to counter. Read More

#russia

Using Machine Learning to Fill Gaps in Chinese AI Market Data

In this proof-of-concept project, CSET and Amplyfi Ltd. used machine learning models and Chinese-language web data to identify Chinese companies active in artificial intelligence. Most of these companies were not labeled or described as AI-related in two high-quality commercial datasets. The authors’ findings show that using structured data alone—even from the best providers—will yield an incomplete picture of the Chinese AI landscape. Read More

#china-ai