Language models (LMs) are powerful tools for natural language processing, but they often struggle to produce coherent and fluent text when they are small. Models with around 125M parameters such as GPT-Neo (small) or GPT-2 (small) can rarely generate coherent and consistent English text beyond a few words even after extensive training. This raises the question of whether the emergence of the ability to produce coherent English text only occurs at larger scales (with hundreds of millions of parameters or more) and complex architectures (with many layers of global attention).
In this work, we introduce TinyStories, a synthetic dataset of short stories that only contain words that a typical 3 to 4-year-olds usually understand, generated by GPT-3.5 and GPT-4. We show that TinyStories can be used to train and evaluate LMs that are much smaller than the state-of-the-art models (below 10 million total parameters), or have much simpler architectures (with only one transformer block), yet still produce fluent and consistent stories with several paragraphs that are diverse and have almost perfect grammar, and demonstrate reasoning capabilities.
We also introduce a new paradigm for the evaluation of language models: We suggest a framework which uses GPT-4 to grade the content generated by these models as if those were stories written by students and graded by a (human) teacher. This new paradigm overcomes the flaws of standard benchmarks which often requires the model’s output to be very structures, and moreover provides a multidimensional score for the model, providing scores for different capabilities such as grammar, creativity and consistency.
We hope that TinyStories can facilitate the development, analysis and research of LMs, especially for low-resource or specialized domains, and shed light on the emergence of language capabilities in LMs. — Read More
Daily Archives: May 17, 2023
The First Year of AI College Ends in Ruin
There’s an arms race on campus, and professors are losing.
One-hundred percent ai. That’s what the software concluded about a student’s paper. One of the professors in the academic program I direct had come across this finding and asked me what to do with it. Then another one saw the same result—100 percent AI—for a different paper by that student, and also wondered: What does this mean? I did not know. I still don’t.
The problem breaks down into more problems: whether it’s possible to know for certain that a student used AI, what it even means to “use” AI for writing papers, and when that use amounts to cheating. The software that had flagged our student’s papers was also multilayered: Canvas, our courseware system, was running Turnitin, a popular plagiarism-detection service, which had recently installed a new AI-detection algorithm. The alleged evidence of cheating had emerged from a nesting doll of ed-tech black boxes. Read More
Apple’s new ‘Personal Voice’ feature can create a voice that sounds like you or a loved one in just 15 minutes
As part of its preview of iOS 17 accessibility updates coming this year, Apple has announced a pair of new features called Live Speech and Personal Voice. Live Speech allows users to type what they want to say and have it be spoken out.
Personal Voice, on the other hand, is a way for people who are at risk of losing their ability to speak to create and save a voice that sounds like them. Apple says it’s designed for people at risk of losing their ability to speak, such as those with a recent diagnosis of ALS. Read More
