How to build and run LLM evals — and why you should use precision and recall when benchmarking your LLM prompt template
Large language models (LLMs) are an incredible tool for developers and business leaders to create new value for consumers. They make personal recommendations, translate between unstructured and structured data, summarize large amounts of information, and do so much more.
As the applications multiply, so does the importance of measuring the performance of LLM-based applications. This is a nontrivial problem for several reasons: user feedback or any other “source of truth” is extremely limited and often nonexistent; even when possible, human labeling is still expensive; and it is easy to make these applications complex.
This complexity is often hidden by the abstraction layers of code and only becomes apparent when things go wrong. One line of code can initiate a cascade of calls (spans). Different evaluations are required for each span, thus multiplying your problems. For example, the simple code snippet below triggers multiple sub-LLM calls. — Read More
Tag Archives: Accuracy
Evaluating LLMs is a minefield
How Google taught AI to doubt itself
… From the day that the chatbots arrived last year, their makers warned us not to trust them. The text generated by tools like ChatGPT does not draw on a database of established facts. Instead, chatbots are predictive — making probabilistic guesses about which words seem right based on the massive corpus of text that their underlying large language models were trained on.
As a result, chatbots are often “confidently wrong,” to use the industry’s term. And this can fool even highly educated people, as we saw this year with the case of the lawyer who submitted citations generated by ChatGPT — not realizing that every single case had been fabricated out of whole cloth. — Read More
OpenAI scuttles AI-written text detector over ‘low rate of accuracy’
OpenAI has shut down its AI classifier, a tool that claimed to determine the likelihood a text passage was written by another AI. While many used and perhaps unwisely relied on it to catch low-effort cheats, OpenAI has retired it over its widely criticized “low rate of accuracy.”
The theory that AI-generated text has some identifying feature or pattern that can be detected reliably seems intuitive, but so far this has not really been borne out in practice. Although some generated text may have an obvious tell, the differences between large language models and the rapidity with which they have developed has made those tells all but impossible to rely on. — Read More
GPT detectors are biased against non-native English writers
GPT detectors frequently misclassify non-native English writing as AI generated, raising concerns about fairness and robustness. Addressing the biases in these detectors is crucial to prevent the marginalization of non-native English speakers in evaluative and educational settings and to create a more equitable digital landscape.
… GPT detectors exhibit significant bias against non-native English authors, as demonstrated by their high misclassification of TOEFL essays written by non-native speakers. In our study, we evaluated the performance of seven widely used GPT detectors on 91 TOEFL (Test of English as a Foreign Language) essays from a Chinese forum and 88 US eighth-grade essays from the Hewlett Foundation’s ASAP dataset. While the detectors accurately classified the US student essays, they incorrectly labeled more than half of the TOEFL essays as “AI-generated” (average false-positive rate: 61.3%). — Read More
Gizmodo Editor Slams ‘Shameful’ AI-Written Article: ‘It’s F–king Dogs–t’
Gizmodo’s io9 section, which focuses on science fiction, published an error-riddled article written by “Gizmodo Bot” which deputy editor James Whitbrook said on Wednesday was foisted on the site’s editorial team with little notice.
The AI-generated article, “A Chronological List of Star Wars Movies & TV Shows,” was riddled with factual errors. G/O Media, the owner of Gizmodo, said last week it was starting to use artificial intelligence on its sites, including Gizmodo, The Onion and Deadspin and the Root. — Read More
No, GPT4 can’t ace MIT
A paper seemingly demonstrating that GPT-4 could ace the MIT EECS + Math curriculum recently went viral on twitter, getting over 500 retweets in a single day. Like most, we were excited to read the analysis behind such a feat, but what we found left us surprised and disappointed. Even though the authors of the paper said they manually reviewed the published dataset for quality, we found clear signs that a significant portion of the evaluation dataset was contaminated in such a way that let the model cheat like a student who was fed the answers to a test right before taking it.
We think this should call into greater question the recent flurry of academic work using Large Language Models (LLMs) like GPT to shortcut data validation — a foundational principle in any kind of science, and especially machine learning. These papers are often uploaded to Arxiv and widely shared on Twitter before any legitimate peer review. In this case, potentially spreading bad information and setting a poor precedent for future work. — Read More
AI is Ushering In a New Scientific Revolution
Since the discovery of DNA in the 1950s, biologists have sought to tie lengths of genetic code to a range of cellular parts and processes—including, for example, the mRNA transcription of specific antibodies that powers the now-famous mRNA vaccines. Despite the progress in sequencing and understanding the genome since the discovery of DNA, one big missing link remained. Biologists lacked a way to accurately and efficiently predict the 3-D shape of an unknown protein using just its DNA or RNA source code. In biology, structure determines function. What a protein does in a cell depends on its shape. Cylindrical with a hollow middle makes for a good membrane receptor, while U-shaped enzymes catalyze chemical reactions in their fjord-like cavities. Being able to predict or even design proteins would be a leap forward in our understanding of human disease and unlock new treatments for a range of diseases.
But for more than 70 years, scientists have been stuck with slow methods that strained computers and relied largely on their own guesswork to tease out a protein’s structure. Despite knowing which stretches of DNA code for each of the amino acids that form the building blocks of every protein, biologists lacked a repeatable, generalizable formula to solve this so-called “protein-folding problem.” They needed a systematic understanding of how any string of amino acids, once linked, would fold into a 3-dimensional shape to unlock the vast universe of proteins.
In 2020, Google’s AI team DeepMind announced that its algorithm, AlphaFold, had solved the protein-folding problem. At first, this stunning breakthrough was met with excitement from most, with scientists always ready to test a new tool, and amusement by some. After all, wasn’t this the same company whose algorithm AlphaGo had defeated the world champion in the Chinese strategy game Go, just a few years before? Mastering a game more complex than chess, difficult as that is, felt trivial compared to the protein-folding problem. But AlphaFold proved its scientific mettle by sweeping an annual competition in which teams of biologists guess the structure of proteins based only on their genetic code. The algorithm far outpaced its human rivals, posting scores that predicted the final shape within an angstrom, the width of a single atom. Soon after, AlphaFold passed its first real-world test by correctly predicting the shape of the SARS-CoV-2 ‘spike’ protein, the virus’ conspicuous membrane receptor that is targeted by vaccines. Read More
VOS: Learning what you don’t know by virtual outlier synthesis
Out-of-distribution (OOD) detection has received much attention lately due to its importance in the safe deployment of neural networks. One of the key challenges is that models lack supervision signals from unknown data, and as a result, can produce overconfident predictions on OOD data. Previous approaches rely on real outlier datasets for model regularization, which can be costly and sometimes infeasible to obtain in practice. In this paper, we present VOS, a novel framework for OOD detection by adaptively synthesizing virtual outliers that can meaningfully regularize the model’s decision boundary during training. Specifically, VOS samples virtual outliers from the low-likelihood region of the class-conditional distribution estimated in the feature space. Alongside, we introduce a novel unknown aware training objective, which contrastively shapes the uncertainty space between the ID data and synthesized outlier data. VOS achieves competitive performance on both object detection and image classification models, reducing the FPR95 by up to 9.36% compared to the previous best method on object detectors. Read More
Mind Your Outliers! Investigating the Negative Impact of Outliers on Active Learning for Visual Question Answering
Active learning promises to alleviate the massive data needs of supervised machine learning: it has successfully improved sample efficiency by an order of magnitude on traditional tasks like topic classification and object recognition. However, we uncover a striking contrast to this promise: across 5 models and 4 datasets on the task of visual question answering, a wide variety of active learning approaches fail to outperform random selection. To understand this discrepancy, we profile 8 active learning methods on a per-example basis, and identify the problem as collective outliers – groups of examples that active learning methods prefer to acquire but models fail to learn (e.g., questions that ask about text in images or require external knowledge). Through systematic ablation experiments and qualitative visualizations, we verify that collective outliers are a general phenomenon responsible for degrading pool-based active learning. Notably, we show that active learning sample efficiency increases significantly as the number of collective outliers in the active learning pool decreases. We conclude with a discussion and prescriptive recommendations for mitigating the effects of these outliers in future work Read More
#accuracy