Let me finish your sentences

We’re all stochastic parrots (or) what AI can teach us about being human.

… It turns out that a machine that can finish our sentences can, with very minor modifications, also be made to write essays and stories, to summarize and translate. It can write working code and stylized poetry, generate art in the style of the old masters, and pass the SAT, GRE, LSAT, AP, and Bar exams. It can answer philosophical questions, act as a co-pilot, tutor, and therapist, do your child’s homework, and much more. 

The emergence of such new and general capabilities wasn’t obvious or necessarily a given. Almost no one, not even the creators of ChatGPT fully anticipated its wide spectrum of cognitive and creative abilities. Despite Moravec’s Paradox, very few predicted that skills requiring human creativity would be among the first to fall to AI. — Read More

#nlp

MemGPT: Towards LLMs as Operating Systems

Large language models (LLMs) have revolutionized AI, but are constrained by limited context windows, hindering their utility in tasks like extended conversations and document analysis. To enable using context beyond limited context windows, we propose virtual context management, a technique drawing inspiration from hierarchical memory systems in traditional operating systems that provide the appearance of large memory resources through data movement between fast and slow memory. Using this technique, we introduce MemGPT (Memory-GPT), a system that intelligently manages different memory tiers in order to effectively provide extended context within the LLM’s limited context window, and utilizes interrupts to manage control flow between itself and the user. We evaluate our OS-inspired design in two domains where the limited context windows of modern LLMs severely handicaps their performance: document analysis, where MemGPT is able to analyze large documents that far exceed the underlying LLM’s context window, and multi-session chat, where MemGPT can create conversational agents that remember, reflect, and evolve dynamically through long-term interactions with their users. We release MemGPT code and data for our experiments at this https URL. — Read More

#nlp

LLM generated Wikipedia-like articles

Welcome to AI-generated encyclopaedia. You can click “Next interesting article” to start using the platform. Contact us if you have any feedback. — Read More

#nlp

Decomposing Language Models Into Understandable Components

Neural networks are trained on data, not programmed to follow rules. With each step of training, millions or billions of parameters are updated to make the model better at tasks, and by the end, the model is capable of a dizzying array of behaviors. We understand the math of the trained network exactly – each neuron in a neural network performs simple arithmetic – but we don’t understand why those mathematical operations result in the behaviors we see. This makes it hard to diagnose failure modes, hard to know how to fix them, and hard to certify that a model is truly safe.

Neuroscientists face a similar problem with understanding the biological basis for human behavior. The neurons firing in a person’s brain must somehow implement their thoughts, feelings, and decision-making. Decades of neuroscience research has revealed a lot about how the brain works, and enabled targeted treatments for diseases such as epilepsy, but much remains mysterious. Luckily for those of us trying to understand artificial neural networks, experiments are much, much easier to run. We can simultaneously record the activation of every neuron in the network, intervene by silencing or stimulating them, and test the network’s response to any possible input.

Unfortunately, it turns out that the individual neurons do not have consistent relationships to network behavior. — Read More

#nlp

Finding unique features inside LLMs – Interpretability research by Anthropic

The neural networks in large language models show superposition. That means each neuron in the network represents more than one unique feature. Polysemantic neurons compress many rare features of language which is good for performance but makes them harder to understand. You can’t extract those features individually. Anthropic’s new paper tries to extract these hidden features in a human interpretable form. — Read More

#nlp

Effective Long-Context Scaling of Foundation Models

We present a series of long-context LLMs that support effective context windows of up to 32,768 tokens. Our model series are built through continual pretraining from LLAMA 2 with longer training sequences and on a dataset where long texts are upsampled. We perform extensive evaluation on language modeling, synthetic context probing tasks, and a wide range of research benchmarks. On research benchmarks, our models achieve consistent improvements on most regular tasks and significant improvements on long-context tasks over LLAMA 2. Notably, with a cost-effective instruction tuning procedure that does not require human-annotated long instruction data, the 70B variant can already surpass gpt-3.5-turbo-16k’s overall performance on a suite of long-context tasks. Alongside these results, we provide an in-depth analysis on the individual components of our method. We delve into LLAMA’s position encodings and discuss its limitation in modeling long dependencies. We also examine the impact of various design choices in the pretraining process, including the data mix and the training curriculum of sequence lengths – our ablation experiments suggest that having abundant long texts in the pretrain dataset is not the key to achieving strong performance, and we empirically verify that long context continual pretraining is more efficient and similarly effective compared to pretraining from scratch with long sequences. — Read More

#nlp

Contrastive Decoding Improves Reasoning in Large Language Models

We demonstrate that Contrastive Decoding — a simple, computationally light, and training-free text generation method proposed by Li et al 2022 — achieves large out-of-the-box improvements over greedy decoding on a variety of reasoning tasks. Originally shown to improve the perceived quality of long-form text generation, Contrastive Decoding searches for strings that maximize a weighted difference in likelihood between strong and weak models. We show that Contrastive Decoding leads LLaMA-65B to outperform LLaMA 2, GPT-3.5 and PaLM 2-L on the HellaSwag commonsense reasoning benchmark, and to outperform LLaMA 2, GPT-3.5 and PaLM-540B on the GSM8K math word reasoning benchmark, in addition to improvements on a collection of other tasks. Analysis suggests that Contrastive Decoding improves over existing methods by preventing some abstract reasoning errors, as well as by avoiding simpler modes such as copying sections of the input during chain-of-thought. Overall, Contrastive Decoding outperforms nucleus sampling for long-form generation and greedy decoding for reasoning tasks, making it a powerful general purpose method for generating text from language models. — Read More

#nlp

15 times Faster than Llama 2: Introducing DeciLM – NAS-Generated LLM with Variable GQA

As the deep learning community continues to push the boundaries of Large Language Models (LLMs), the computational demands of these models have surged exponentially for both training and inference. This escalation has not only led to increased costs and energy consumption but also introduced barriers to their deployment and scalability. Achieving a balance between model performance, computational efficiency, and latency has thus become a focal point in recent LLM development.

Within this landscape, we are thrilled to introduce DeciLM 6B, a permissively licensed foundation LLM, and DeciLM 6B-Instruct, fine-tuned from DeciLM 6B for instruction-following use cases. With 5.7 billion parameters, DeciLM 6B delivers a throughput that’s 15 times higher than Llama 2 7B while maintaining comparable quality. Impressively, despite having significantly fewer parameters, DeciLM 6B and DeciLM 6B-Instruct consistently rank among the top-performing LLMs in the 7 billion parameter category across various LLM evaluation tasks. Our models thus establish a new benchmark for inference efficiency and speed. The hallmark of DeciLM 6B lies in its unique architecture, generated using AutoNAC, Deci’s cutting-edge Neural Architecture Search engine, to push the efficient frontier. Moreover, coupling DeciLM 6B with Deci’s inference SDK results in a substantial throughput enhancement. — Read More

#nlp

FLM-101B: An Open LLM and How to Train It with $100K Budget

Large language models (LLMs) have achieved remarkable success in NLP and multimodal tasks. Despite these successes, their development faces two main challenges: (i) high computational cost; and (ii) difficulty in conducting fair and objective evaluations. LLMs are prohibitively expensive, making it feasible for only a few major players to undertake their training, thereby constraining both research and application opportunities. This underscores the importance of cost-effective LLM training. In this paper, we utilize a growth strategy to significantly reduce LLM training cost. We demonstrate that an LLM with 101B parameters and 0.31TB tokens can be trained on a 100K budget. We also adopt a systematic evaluation paradigm for the IQ evaluation of LLMs, in complement to existing evaluations that focus more on knowledge-oriented abilities. We introduce our benchmark including evaluations on important aspects of intelligence including symbolic mapping, rule understanding, pattern mining, and anti-interference. Such evaluations minimize the potential impact of memorization. Experimental results show that our model FLM-101B, trained with a budget of $100K, achieves comparable performance to powerful and well-known models, e.g., GPT-3 and GLM-130B, especially in the IQ benchmark evaluations with contexts unseen in training data. The checkpoint of FLM-101B will be open-sourced at this https URL. — Read More

#nlp, #strategy

IBM rolls out new generative AI features and models

Fighting for relevance in the growing — and ultra-competitive — AI space, IBM this week introduced new generative AI models and capabilities across its recently launched Watsonx data science platform.

The new models, called the Granite series models, appear to be standard large language models (LLMs) along the lines of OpenAI’s GPT-4 and ChatGPT, capable of summarizing, analyzing and generating text. IBM provided very little in the way of details about Granite, making it impossible to compare the models to rival LLMs — including IBM’s own. But the company claims that it’ll reveal the data used to train the Granite series models, as well as the steps used to filter and process that data, ahead of the models’ availability in Q3 2023. — Read More

#nlp