How open-source LLMs are challenging OpenAI, Google, and Microsoft

In the past few years, it seemed that wealthy tech companies would be able to monopolize the growing market for large language models (LLM). And recent earnings calls from big tech companies suggested they are in control. Microsoft’s announcements, in particular, show that the company has created a billion-dollar business from its AI services, including through Azure OpenAI Services and the workloads OpenAI runs on its cloud infrastructure.

However, a recently leaked internal document from Google indicates that the market share of big tech is not as secure as it seems thanks to advances in open-source LLMs. In short, the document says “We have no moat, and neither does OpenAI.” The dynamics of the market are gradually shifting from “bigger is better” to “cheaper is better,” “more efficient is better,” and “customizable is better.” And while there will always be a market for cloud-based LLM and generative AI products, customers now have open-source options to explore as well. — Read More

#devops, #nlp

The Falcon has landed in the Hugging Face ecosystem

Falcon is a new family of state-of-the-art language models created by the Technology Innovation Institute in Abu Dhabi, and released under the Apache 2.0 license. Notably, Falcon-40B is the first “truly open” model with capabilities rivaling many current closed-source models. This is fantastic news for practitioners, enthusiasts, and industry, as it opens the door for many exciting use cases.

In this blog, we will be taking a deep dive into the Falcon models: first discussing what makes them unique and then showcasing how easy it is to build on top of them (inference, quantization, finetuning, and more) with tools from the Hugging Face ecosystem. — Read More

#devops, #nlp

Open-Source LLMs

In February, Meta released its large language model: LLaMA. Unlike OpenAI and its ChatGPT, Meta didn’t just give the world a chat window to play with. Instead, it released the code into the open-source community, and shortly thereafter the model itself was leaked. Researchers and programmers immediately started modifying it, improving it, and getting it to do things no one else anticipated. And their results have been immediate, innovative, and an indication of how the future of this technology is going to play out. Training speeds have hugely increased, and the size of the models themselves has shrunk to the point that you can create and run them on a laptop. The world of AI research has dramatically changed.

This development hasn’t made the same splash as other corporate announcements, but its effects will be much greater. It will wrest power from the large tech corporations, resulting in both much more innovation and a much more challenging regulatory landscape. The large corporations that had controlled these models warn that this free-for-all will lead to potentially dangerous developments, and problematic uses of the open technology have already been documented. But those who are working on the open models counter that a more democratic research environment is better than having this powerful technology controlled by a small number of corporations. — Read More

#devops, #nlp

A PhD Student’s Perspective on Research in NLP in the Era of Very Large Language Models

Recent progress in large language models has enabled the deployment of many generative NLP applications. At the same time, it has also led to a misleading public discourse that “it’s all been solved.” Not surprisingly, this has in turn made many NLP researchers — especially those at the beginning of their career — wonder about what NLP research area they should focus on. This document is a compilation of NLP research directions that are rich for exploration, reflecting the views of a diverse group of PhD students in an academic research lab. While we identify many research areas, many others exist; we do not cover those areas that are currently addressed by LLMs but where LLMs lag behind in performance, or those focused on LLM development. — Read More

#nlp

MEGABYTE, Meta AI’s New Revolutionary Model Architecture, Explained

Unlocking the true potential of content generation in natural language processing (NLP) has always been a challenge. Traditional models struggle with long sequences, scalability, and sluggish generation speed. 

But fear not, as Meta AI brings forth MEGABYTE – a groundbreaking model architecture that revolutionizes content generation. In this blog, we will dive deep into the secrets behind MEGABYTE’s potential, its innovative features, and how it tackles the limitations of current approaches. — Read More

#big7, #chatbots, #nlp

Preserving the World’s Language Diversity Through AI

Many of the world’s languages are in danger of disappearing, and the limitations of current speech recognition and generation technology will only accelerate this trend. We want to make it easier for people to access information and use devices in their preferred language, and today we’re announcing a series of artificial intelligence (AI) models that could help them do just that.

Massively Multilingual Speech (MMS) models expand text-to-speech and speech-to-text technology from around 100 languages to more than 1,100 — more than 10 times as many as before — and can also identify more than 4,000 spoken languages, 40 times more than before.

… We’re open-sourcing our models and code so that others in the research community can build on our work and help preserve the world’s languages and bring the world closer together. — Read More

#big7, #nlp

OpenAI bets $30M on this GPT-powered education appㅣSpeak

Read More

#nlp, #videos

TinyStories: How Small Can Language Models Be and Still Speak Coherent English?

Language models (LMs) are powerful tools for natural language processing, but they often struggle to produce coherent and fluent text when they are small. Models with around 125M parameters such as GPT-Neo (small) or GPT-2 (small) can rarely generate coherent and consistent English text beyond a few words even after extensive training. This raises the question of whether the emergence of the ability to produce coherent English text only occurs at larger scales (with hundreds of millions of parameters or more) and complex architectures (with many layers of global attention).

In this work, we introduce TinyStories, a synthetic dataset of short stories that only contain words that a typical 3 to 4-year-olds usually understand, generated by GPT-3.5 and GPT-4. We show that TinyStories can be used to train and evaluate LMs that are much smaller than the state-of-the-art models (below 10 million total parameters), or have much simpler architectures (with only one transformer block), yet still produce fluent and consistent stories with several paragraphs that are diverse and have almost perfect grammar, and demonstrate reasoning capabilities.

We also introduce a new paradigm for the evaluation of language models: We suggest a framework which uses GPT-4 to grade the content generated by these models as if those were stories written by students and graded by a (human) teacher. This new paradigm overcomes the flaws of standard benchmarks which often requires the model’s output to be very structures, and moreover provides a multidimensional score for the model, providing scores for different capabilities such as grammar, creativity and consistency.

We hope that TinyStories can facilitate the development, analysis and research of LMs, especially for low-resource or specialized domains, and shed light on the emergence of language capabilities in LMs. — Read More

#nlp

The First Year of AI College Ends in Ruin

There’s an arms race on campus, and professors are losing.

One-hundred percent ai. That’s what the software concluded about a student’s paper. One of the professors in the academic program I direct had come across this finding and asked me what to do with it. Then another one saw the same result—100 percent AI—for a different paper by that student, and also wondered: What does this mean? I did not know. I still don’t.

The problem breaks down into more problems: whether it’s possible to know for certain that a student used AI, what it even means to “use” AI for writing papers, and when that use amounts to cheating. The software that had flagged our student’s papers was also multilayered: Canvas, our courseware system, was running Turnitin, a popular plagiarism-detection service, which had recently installed a new AI-detection algorithm. The alleged evidence of cheating had emerged from a nesting doll of ed-tech black boxes. Read More

#fake, #nlp

DarkBERT: A Language Model for the Dark Side of the Internet

Recent research has suggested that there are clear differences in the language used in the Dark Web compared to that of the Surface Web. As studies on the Dark Web commonly require textual analysis of the domain, language models specific to the Dark Web may provide valuable insights to researchers. In this work, we introduce DarkBERT, a language model pretrained on Dark Web data. We describe the steps taken to filter and compile the text data used to train DarkBERT to combat the extreme lexical and structural diversity of the Dark Web that may be detrimental to building a proper representation of the domain. We evaluate DarkBERT and its vanilla counterpart along with other widely used language models to validate the benefits that a Dark Web domain specific model offers in various use cases. Our evaluations show that DarkBERT outperforms current language models and may serve as a valuable resource for future research on the Dark Web. — Read More

#cyber, #nlp