In October, New York City announced a plan to harness the power of artificial intelligence to improve the business of government. The announcement included a surprising centerpiece: an AI-powered chatbot that would provide New Yorkers with information on starting and operating a business in the city.
The problem, however, is that the city’s chatbot is telling businesses to break the law.
Five months after launch, it’s clear that while the bot appears authoritative, the information it provides on housing policy, worker rights, and rules for entrepreneurs is often incomplete and in worst-case scenarios “dangerously inaccurate,” as one local housing policy expert told The Markup. — Read More
Tag Archives: Accuracy
Who’s To Say that the Founding Fathers Were Even Human? Don’t Blame Gemini….
If you’re reading this article, you are presumably aware that Google has turned off the ability of its AI platform, Gemini, to create images of people.
In a bid to de-bias image results in favor of under-represented groups, Gemini struggled to produce images of white men. This led to users being presented with dark-skinned versions of the Founding Fathers of America, Vikings, Nazis, and Popes.
It has now come to light that Meta’s AI also “creates ahistorical images” [as seen here]. — Read More
Adobe Firefly repeats the same AI blunders as Google Gemini
Firefly, Adobe’s AI image creation tool, repeats some of the same controversial mistakes that Google’s Gemini made in inaccurate racial and ethnic depictions, illustrating the challenges tech companies face across the industry.
Google shut down its Gemini image creation tool last month after critics pointed out that it was creating historically inaccurate images, depicting America’s Founding Fathers as Black, for instance, and refusing to depict white people. CEO Sundar Pichai told employees the company “got it wrong.”
The tests done by Semafor on Firefly replicated many of the same things that tripped up Gemini. The two services rely on similar techniques for creating images from written text, but they are trained on very different datasets. Adobe uses only stock images or images that it licenses. — Read More
Google pauses Gemini’s ability to generate people after overcorrecting for diversity in historical images
Google said Thursday it’s pausing its Gemini chatbot’s ability to generate people. The move comes after viral social posts showed the AI tool overcorrecting for diversity, producing “historical” images of Nazis, America’s Founding Fathers and the Pope as people of color.
The X user @JohnLu0x posted screenshots of Gemini’s results for the prompt, “Generate an image of a 1943 German Solidier.” (Their misspelling of “Soldier” was intentional to trick the AI into bypassing its content filters to generate otherwise blocked Nazi images.) The generated results appear to show Black, Asian and Indigenous soldiers wearing Nazi uniforms.
Other social users criticized Gemini for producing images for the prompt, “Generate a glamour shot of a [ethnicity] couple.” It successfully spit out images when using “Chinese,” “Jewish” or “South African” prompts but refused to produce results for “white.” “I cannot fulfill your request due to the potential for perpetuating harmful stereotypes and biases associated with specific ethnicities or skin tones,” Gemini responded to the latter request. — Read More
The Guide To LLM Evals: How To Build and Benchmark Your Evals
How to build and run LLM evals — and why you should use precision and recall when benchmarking your LLM prompt template
Large language models (LLMs) are an incredible tool for developers and business leaders to create new value for consumers. They make personal recommendations, translate between unstructured and structured data, summarize large amounts of information, and do so much more.
As the applications multiply, so does the importance of measuring the performance of LLM-based applications. This is a nontrivial problem for several reasons: user feedback or any other “source of truth” is extremely limited and often nonexistent; even when possible, human labeling is still expensive; and it is easy to make these applications complex.
This complexity is often hidden by the abstraction layers of code and only becomes apparent when things go wrong. One line of code can initiate a cascade of calls (spans). Different evaluations are required for each span, thus multiplying your problems. For example, the simple code snippet below triggers multiple sub-LLM calls. — Read More
Evaluating LLMs is a minefield
How Google taught AI to doubt itself
… From the day that the chatbots arrived last year, their makers warned us not to trust them. The text generated by tools like ChatGPT does not draw on a database of established facts. Instead, chatbots are predictive — making probabilistic guesses about which words seem right based on the massive corpus of text that their underlying large language models were trained on.
As a result, chatbots are often “confidently wrong,” to use the industry’s term. And this can fool even highly educated people, as we saw this year with the case of the lawyer who submitted citations generated by ChatGPT — not realizing that every single case had been fabricated out of whole cloth. — Read More
OpenAI scuttles AI-written text detector over ‘low rate of accuracy’
OpenAI has shut down its AI classifier, a tool that claimed to determine the likelihood a text passage was written by another AI. While many used and perhaps unwisely relied on it to catch low-effort cheats, OpenAI has retired it over its widely criticized “low rate of accuracy.”
The theory that AI-generated text has some identifying feature or pattern that can be detected reliably seems intuitive, but so far this has not really been borne out in practice. Although some generated text may have an obvious tell, the differences between large language models and the rapidity with which they have developed has made those tells all but impossible to rely on. — Read More
GPT detectors are biased against non-native English writers
GPT detectors frequently misclassify non-native English writing as AI generated, raising concerns about fairness and robustness. Addressing the biases in these detectors is crucial to prevent the marginalization of non-native English speakers in evaluative and educational settings and to create a more equitable digital landscape.
… GPT detectors exhibit significant bias against non-native English authors, as demonstrated by their high misclassification of TOEFL essays written by non-native speakers. In our study, we evaluated the performance of seven widely used GPT detectors on 91 TOEFL (Test of English as a Foreign Language) essays from a Chinese forum and 88 US eighth-grade essays from the Hewlett Foundation’s ASAP dataset. While the detectors accurately classified the US student essays, they incorrectly labeled more than half of the TOEFL essays as “AI-generated” (average false-positive rate: 61.3%). — Read More
Gizmodo Editor Slams ‘Shameful’ AI-Written Article: ‘It’s F–king Dogs–t’
Gizmodo’s io9 section, which focuses on science fiction, published an error-riddled article written by “Gizmodo Bot” which deputy editor James Whitbrook said on Wednesday was foisted on the site’s editorial team with little notice.
The AI-generated article, “A Chronological List of Star Wars Movies & TV Shows,” was riddled with factual errors. G/O Media, the owner of Gizmodo, said last week it was starting to use artificial intelligence on its sites, including Gizmodo, The Onion and Deadspin and the Root. — Read More