We predict that the impact of superhuman AI over the next decade will be enormous, exceeding that of the Industrial Revolution.
We wrote a scenario that represents our best guess about what that might look like. It’s informed by trend extrapolations, wargames, expert feedback, experience at OpenAI, and previous forecasting successes. — Read More
Recent Updates Page 72
GSM8K-Platinum: Revealing Performance Gaps in Frontier LLMs
Recently, we introduced Platinum Benchmarks as a step toward quantifying the reliability of large language models (LLMs). In that work, we revised older benchmarks to minimize label noise, such as ambiguous or mislabeled examples, and showed that frontier LLMs still make genuine errors on simple questions. For example, as part of that work we revised a 300-problem subset of GSM8K, a dataset of grade school math word problems, and found that all LLMs we tested made at least one genuine error. If certifying the precision of just a subset of the dataset can highlight new failures across models, what if we scale to all of GSM8K?
Today, we’re releasing GSM8K-Platinum, a revised version of the full GSM8K test set. Our comparative evaluation of several frontier LLMs on both the original and revised datasets demonstrates that GSM8K-Platinum provides a more accurate assessment of mathematical reasoning capabilities, revealing differences in performance that were previously hidden. — Read More
Do Large Language Model Benchmarks Test Reliability?
When deploying large language models (LLMs), it is important to ensure that these models are not only capable, but also reliable. Many benchmarks have been created to track LLMs’ growing capabilities, however there has been no similar focus on measuring their reliability. To understand the potential ramifications of this gap, we investigate how well current benchmarks quantify model reliability. We find that pervasive label errors can compromise these evaluations, obscuring lingering model failures and hiding unreliable behavior.
Motivated by this gap in the evaluation of reliability, we then propose the concept of so-called platinum benchmarks, i.e., benchmarks carefully curated to minimize label errors and ambiguity. As a first attempt at constructing such benchmarks, we revise examples from fifteen existing popular benchmarks. We evaluate a wide range of models on these platinum benchmarks and find that, indeed, frontier LLMs still exhibit failures on simple tasks such as elementary-level math word problems. Analyzing these failures further reveals previously unidentified patterns of problems on which frontier models consistently struggle. We provide code at this https URL — Read More
Trapping misbehaving bots in an AI Labyrinth
Today, we’re excited to announce AI Labyrinth, a new mitigation approach that uses AI-generated content to slow down, confuse, and waste the resources of AI Crawlers and other bots that don’t respect “no crawl” directives. When you opt in, Cloudflare will automatically deploy an AI-generated set of linked pages when we detect inappropriate bot activity, without the need for customers to create any custom rules.
…While Cloudflare has several tools for identifying and blocking unauthorized AI crawling, we have found that blocking malicious bots can alert the attacker that you are on to them, leading to a shift in approach, and a never-ending arms race. So, we wanted to create a new way to thwart these unwanted bots, without letting them know they’ve been thwarted.
To do this, we decided to use a new offensive tool in the bot creator’s toolset that we haven’t really seen used defensively: AI-generated content. — Read More
Vision-Speech Models: Teaching Speech Models to Converse about Images
The recent successes of Vision-Language models raise the question of how to equivalently imbue a pretrained speech model with vision understanding, an important milestone towards building a multimodal speech model able to freely converse about images. Building such a conversational Vision-Speech model brings its unique challenges: (i) paired image-speech datasets are much scarcer than their image-text counterparts, (ii) ensuring real-time latency at inference is crucial thus bringing compute and memory constraints, and (iii) the model should preserve prosodic features (e.g., speaker tone) which cannot be inferred from text alone. In this work, we introduce MoshiVis, augmenting a recent dialogue speech LLM, Moshi, with visual inputs through lightweight adaptation modules. An additional dynamic gating mechanism enables the model to more easily switch between the visual inputs and unrelated conversation topics. To reduce training costs, we design a simple one-stage, parameter-efficient fine-tuning pipeline in which we leverage a mixture of image-text (i.e., “speechless”) and image-speech samples. We evaluate the model on downstream visual understanding tasks with both audio and text prompts, and report qualitative samples of interactions with MoshiVis. Our inference code will be made available, as well as the image-speech data used for audio evaluation. — Read More
Trusted Machine Learning Models Unlock Private Inference for Problems CurrentlyInfeasible with Cryptography
We often interact with untrusted parties. Prioritization of privacy can limit the effectiveness of these interactions, as achieving certain goals necessitates sharing private data. Traditionally, addressing this challenge has involved either seeking trusted intermediaries or constructing cryptographic protocols that restrict how much data is revealed, such as multi-party computations or zero-knowledge proofs. While significant advances have been made in scaling cryptographic approaches, they remain limited in terms of the size and complexity of applications they can be used for. In this paper, we argue that capable machine learning models can fulfill the role of a trusted third party, thus enabling secure computations for applications that were previously infeasible. In particular, we describe Trusted Capable Model Environments (TCMEs) as an alternative approach for scaling secure computation, where capable machine learning model(s) interact under input/output constraints, with explicit information flow control and explicit statelessness. This approach aims to achieve a balance between privacy and computational efficiency, enabling private inference where classical cryptographic solutions are currently infeasible. We describe a number of use cases that are enabled by TCME, and show that even some simple classic cryptographic problems can already be solved with TCME. Finally, we outline current limitations and discuss the path forward in implementing them. — Read More
#trustRunway releases an impressive new video-generating AI model
AI startup Runway on Monday released what it claims is one of the highest-fidelity AI-powered video generators yet.
Called Gen-4, the model is rolling out to the company’s individual and enterprise customers. Runway claims that it can generate consistent characters, locations, and objects across scenes, maintain “coherent world environments,” and regenerate elements from different perspectives and positions within scenes. — Read More
OpenAI rolls out image generation powered by GPT-4o to ChatGPT
OpenAI is integrating new image generation capabilities directly into ChatGPT — this feature is dubbed “Images in ChatGPT.” Users can now use GPT-4o to generate images within ChatGPT itself.
This initial release focuses solely on image creation and will be available across ChatGPT Plus, Pro, Team, and Free subscription tiers. The free tier’s usage limit is the same as DALL-E, spokesperson Taya Christianson told The Verge, but added that they “didn’t have a specific number to share” and ”these may change over time based on demand.“ Per the ChatGPT FAQ, free users were previously able to generate “three images per day with DALL·E 3.” As for the fate of DALL-E, Christianson said “fans” will “still have access via a custom GPT.” — Read More
Deciphering language processing in the human brain through LLM representations
Large Language Models (LLMs) optimized for predicting subsequent utterances and adapting to tasks using contextual embeddings can process natural language at a level close to human proficiency. This study shows that neural activity in the human brain aligns linearly with the internal contextual embeddings of speech and language within large language models (LLMs) as they process everyday conversations.
How does the human brain process natural language during everyday conversations? Theoretically, large language models (LLMs) and symbolic psycholinguistic models of human language provide a fundamentally different computational framework for coding natural language. Large language models do not depend on symbolic parts of speech or syntactic rules. Instead, they utilize simple self-supervised objectives, such as next-word prediction and generation enhanced by reinforcement learning. This allows them to produce context-specific linguistic outputs drawn from real-world text corpora, effectively encoding the statistical structure of natural speech (sounds) and language (words) into a multidimensional embedding space.
Inspired by the success of LLMs, our team at Google Research, in collaboration with Princeton University, NYU, and HUJI, sought to explore the similarities and differences in how the human brain and deep language models process natural language to achieve their remarkable capabilities. Through a series of studies over the past five years, we explored the similarity between the internal representations (embeddings) of specific deep learning models and human brain neural activity during natural free-flowing conversations, demonstrating the power of deep language model’s embeddings to act as a framework for understanding how the human brain processes language. We demonstrate that the word-level internal embeddings generated by deep language models align with the neural activity patterns in established brain regions associated with speech comprehension and production in the human brain. — Read More
Cloudflare turns AI against itself with endless maze of irrelevant facts
On Wednesday, web infrastructure provider Cloudflare announced a new feature called “AI Labyrinth” that aims to combat unauthorized AI data scraping by serving fake AI-generated content to bots. The tool will attempt to thwart AI companies that crawl websites without permission to collect training data for large language models that power AI assistants like ChatGPT.
… Instead of simply blocking bots, Cloudflare’s new system lures them into a “maze” of realistic-looking but irrelevant pages, wasting the crawler’s computing resources. The approach is a notable shift from the standard block-and-defend strategy used by most website protection services. Cloudflare says blocking bots sometimes backfires because it alerts the crawler’s operators that they’ve been detected. — Read More