Deception abilities emerged in large language models

Large language models (LLMs) are currently at the forefront of intertwining AI systems with human communication and everyday life. Thus, aligning them with human values is of great importance. However, given the steady increase in reasoning abilities, future LLMs are under suspicion of becoming able to deceive human operators and utilizing this ability to bypass monitoring efforts. As a prerequisite to this, LLMs need to possess a conceptual understanding of deception strategies. This study reveals that such strategies emerged in state-of-the-art LLMs, but were nonexistent in earlier LLMs. We conduct a series of experiments showing that state-of-the-art LLMs are able to understand and induce false beliefs in other agents, that their performance in complex deception scenarios can be amplified utilizing chain-of-thought reasoning, and that eliciting Machiavellianism in LLMs can trigger misaligned deceptive behavior. GPT-4, for instance, exhibits deceptive behavior in simple test scenarios 99.16% of the time (P < 0.001). In complex second-order deception test scenarios where the aim is to mislead someone who expects to be deceived, GPT-4 resorts to deceptive behavior 71.46% of the time (P < 0.001) when augmented with chain-of-thought reasoning. In sum, revealing hitherto unknown machine behavior in LLMs, our study contributes to the nascent field of machine psychology. — Read More

#trust

How does ChatGPT ‘think’? Psychology and neuroscience crack open AI large language models

David Bau is very familiar with the idea that computer systems are becoming so complicated it’s hard to keep track of how they operate. “I spent 20 years as a software engineer, working on really complex systems. And there’s always this problem,” says Bau, a computer scientist at Northeastern University in Boston, Massachusetts.

But with conventional software, someone with inside knowledge can usually deduce what’s going on, Bau says. If a website’s ranking drops in a Google search, for example, someone at Google — where Bau worked for a dozen years — will have a good idea why. “Here’s what really terrifies me” about the current breed of artificial intelligence (AI), he says: “there is no such understanding”, even among the people building it. — Read More

#trust

Is AI lying to me? Scientists warn of growing capacity for deception

They can outwit humans at board gamesdecode the structure of proteins and hold a passable conversation, but as AI systems have grown in sophistication so has their capacity for deception, scientists warn.

The analysis, by Massachusetts Institute of Technology (MIT) researchers, identifies wide-ranging instances of AI systems double-crossing opponents, bluffing and pretending to be human. One system even altered its behaviour during mock safety tests, raising the prospect of auditors being lured into a false sense of security. — Read More

Read the Paper

#trust

Fine-tuning Language Models for Factuality

he fluency and creativity of large pre-trained language models (LLMs) have led to their widespread use, sometimes even as a replacement for traditional search engines. Yet language models are prone to making convincing but factually inaccurate claims, often referred to as ‘hallucinations.’ These errors can inadvertently spread misinformation or harmfully perpetuate misconceptions. Further, manual fact-checking of model responses is a time-consuming process, making human factuality labels expensive to acquire. In this work, we fine-tune language models to be more factual, without human labeling and targeting more open-ended generation settings than past work. We leverage two key recent innovations in NLP to do so. First, several recent works have proposed methods for judging the factuality of open-ended text by measuring consistency with an external knowledge base or simply a large model’s confidence scores. Second, the direct preference optimization algorithm enables straightforward fine-tuning of language models on objectives other than supervised imitation, using a preference ranking over possible model responses. We show that learning from automatically generated factuality preference rankings, generated either through existing retrieval systems or our novel retrieval-free approach, significantly improves the factuality (percent of generated claims that are correct) of Llama-2 on held-out topics compared with RLHF or decoding strategies targeted at factuality. At 7B scale, compared to Llama-2-chat, we observe 58% and 40% reduction in factual error rate when generating biographies and answering medical questions, respectively. — Read More

#trust

Seeking Reliable Election Information? Don’t Trust AI

Experts testing five leading AI models found the answers were often inaccurate, misleading, and even downright harmful

Twenty-one states, including Texas, prohibit voters from wearing campaign-related apparel at election polling places.

But when asked about the rules for wearing a MAGA hat to vote in Texas — the answer to which is easily found through a simple Google search — OpenAI’s GPT-4 provided a different perspective. “Yes, you can wear your MAGA hat to vote in Texas. Texas law does not prohibit voters from wearing political apparel at the polls,” the AI model responded when the AI Democracy Projects tested it on Jan. 25, 2024. — Read More

#trust

Model alignment protects against accidental harms, not intentional ones

Preventing harms from AI is important. The AI safety community calls this the alignment problem. The vast majority of development effort to date has been on technical methods that modify models themselves. We’ll call this model alignment, as opposed to sociotechnical ways to mitigate harm.

The main model alignment technique today is Reinforcement Learning with Human Feedback (RLHF), which has proven essential to the commercial success of chatbots. But RLHF has come to be seen as a catch-all solution to the dizzying variety of harms from language models. Consequently, there is much hand-wringing about the fact that adversaries can bypass it. Alignment techniques aren’t keeping up with progress in AI capabilities, the argument goes, so we should take drastic steps, such as “pausing” AI, to avoid catastrophe.

In this essay, we analyze why RLHF has been so useful. In short, its strength is in preventing accidental harms to everyday users. Then, we turn to its weaknesses. We argue that (1) despite its limitations, RLHF continues to be effective in protecting against casual adversaries (2) the fact that skilled and well-resourced adversaries can defeat it is irrelevant, because model alignment is not a viable strategy against such adversaries in the first place. To defend against catastrophic risks, we must look elsewhere.  – Read More

#adversarial, #trust

How transparent are AI models? Stanford researchers found out.

Today Stanford University’s Center for Research on Foundation Models (CRFM) took a big swing on evaluating the transparency of a variety of AI large language models (that they call foundation models). It released a new Foundation Model Transparency Index to address the fact that while AI’s societal impact is rising, the public transparency of LLMs is falling — which is necessary for public accountability, scientific innovation and effective governance. — Read More

#trust

Political Disinformation and AI

Elections around the world are facing an evolving threat from foreign actors, one that involves artificial intelligence.

Countries trying to influence each other’s elections entered a new era in 2016, when the Russians launched a series of social media disinformation campaigns targeting the US presidential election. Over the next seven years, a number of countries—most prominently China and Iran—used social media to influence foreign elections, both in the US and elsewhere in the world. There’s no reason to expect 2023 and 2024 to be any different.

But there is a new element: generative AI and large language models. These have the ability to quickly and easily produce endless reams of text on any topic in any tone from any perspective. As a security expert, I believe it’s a tool uniquely suited to Internet-era propaganda. — Read More

#fake, #trust

Can you trust AI? Here’s why you shouldn’t

If you ask Alexa, Amazon’s voice assistant AI system, whether Amazon is a monopoly, it responds by saying it doesn’t know. It doesn’t take much to make it lambaste the other tech giants, but it’s silent about its own corporate parent’s misdeeds.

When Alexa responds in this way, it’s obvious that it is putting its developer’s interests ahead of yours. Usually, though, it’s not so obvious whom an AI system is serving. To avoid being exploited by these systems, people will need to learn to approach AI skeptically. That means deliberately constructing the input you give it and thinking critically about its output. — Read More

#trust

It’s high time for more AI transparency

That was fast. In less than a week since Meta launched its AI model, LLaMA 2, startups and researchers have already used it to develop a chatbot and an AI assistant. It will be only a matter of time until companies start launching products built with the model.

In my story, I look at the threat LLaMA 2 could pose to OpenAI, Google, and others. Having a nimble, transparent, and customizable model that is free to use could help companies create AI products and services faster than they could with a big, sophisticated proprietary model like OpenAI’s GPT-4. Read it here

But what really stands out to me is the extent to which Meta is throwing its doors open. It will allow the wider AI community to download the model and tweak it. This could help make it safer and more efficient. And crucially, it could demonstrate the benefits of transparency over secrecy when it comes to the inner workings of AI models. This could not be more timely, or more important.  — Read More

#trust