The efficiency of large language models (LLMs) is fundamentally limited by their sequential, token-by-token generation process. We argue that overcoming this bottleneck requires a new design axis for LLM scaling: increasing the semantic bandwidth of each generative step. To this end, we introduce Continuous Autoregressive Language Models (CALM), a paradigm shift from discrete next-token prediction to continuous next-vector prediction. CALM uses a high-fidelity autoencoder to compress a chunk of K tokens into a single continuous vector, from which the original tokens can be reconstructed with over 99.9\% accuracy. This allows us to model language as a sequence of continuous vectors instead of discrete tokens, which reduces the number of generative steps by a factor of K. The paradigm shift necessitates a new modeling toolkit; therefore, we develop a comprehensive likelihood-free framework that enables robust training, evaluation, and controllable sampling in the continuous domain. Experiments show that CALM significantly improves the performance-compute trade-off, achieving the performance of strong discrete baselines at a significantly lower computational cost. More importantly, these findings establish next-vector prediction as a powerful and scalable pathway towards ultra-efficient language models. Code: this https URL. Project: this https URL. — Read More
Tag Archives: Deep Learning
KAN: Kolmogorov-Arnold Networks
Inspired by the Kolmogorov-Arnold representation theorem, we propose Kolmogorov-Arnold Networks (KANs) as promising alternatives to Multi-Layer Perceptrons (MLPs). While MLPs have fixed activation functions on nodes (“neurons”), KANs have learnable activation functions on edges (“weights”). KANs have no linear weights at all — every weight parameter is replaced by a univariate function parametrized as a spline. We show that this seemingly simple change makes KANs outperform MLPs in terms of accuracy and interpretability. For accuracy, much smaller KANs can achieve comparable or better accuracy than much larger MLPs in data fitting and PDE solving. Theoretically and empirically, KANs possess faster neural scaling laws than MLPs. For interpretability, KANs can be intuitively visualized and can easily interact with human users. Through two examples in mathematics and physics, KANs are shown to be useful collaborators helping scientists (re)discover mathematical and physical laws. In summary, KANs are promising alternatives for MLPs, opening opportunities for further improving today’s deep learning models which rely heavily on MLPs. — Read More
The Universal Approximation Theorem
Artificial Intelligence has become very present in the media in the last couple of years. At the end of 2022, ChatGPT has captured the world’s attention, showing at least a hundred million users around the globe the extraordinary potential of large language models. Large language models such as LLaMA, Bard and ChatGPT mimic intelligent behavior equivalent to, or indistinguishable from, that of a human in specific areas (i.e., Imitation Game or Turing Test). Stephen Wolfram has written an article about how ChatGPT works.
Year-end 2022 might therefore be a watershed moment for human mankind since Artificial Intelligence has now the potential to change the way how humans think and work
… All these achievements have one thing in common – they are build on a model using an Artificial Neural Networks (ANN). … ANN are very good function approximators provided that big data of the corresponding domain is available. Almost all practical problems such as playing a game of Go or mimic intelligent behavior can be represented by mathematical functions.
The corresponding theorem that formulates this basic idea of approximation is called Universal Approximation Theorem. It is a fundamental result in the field of ANN, which states that certain types of neural network can approximate certain function to any desired degree of accuracy. This theorem suggest that a neural network is capable of learning complex patterns and relationships in data as long as certain conditions are fulfilled.
The Universal Approximation Theorem is the root-cause why ANN are so successful and capable in solving a wide range of problems in machine learning and other fields. — Read More
Deep Learning Is Hitting a Wall
What would it take for artificial intelligence to make real progress?
Let me start by saying a few things that seem obvious,” Geoffrey Hinton, “Godfather” of deep learning, and one of the most celebrated scientists of our time, told a leading AI conference in Toronto in 2016. “If you work as a radiologist you’re like the coyote that’s already over the edge of the cliff but hasn’t looked down.” Deep learning is so well-suited to reading images from MRIs and CT scans, he reasoned, that people should “stop training radiologists now” and that it’s “just completely obvious within five years deep learning is going to do better.”
Fast forward to 2022, and not a single radiologist has been replaced. Rather, the consensus view nowadays is that machine learning for radiology is harder than it looks; at least for now, humans and machines complement each other’s strengths. Read More
Meta AIs shocking insight about Big Data and Deep Learning
Thanks to the amazing success of AI, we’ve seen more and more organizations implement Machine Learning into their pipelines. As the access to and collection of data increases, we have seen massive datasets being used to train giant deep learning models that reach superhuman performances. This has led to a lot of hype around domains like Data Science and Big Data, fueled even more by the recent boom in Large Language Models.
Big Tech companies (and Deep Learning Experts on Twitter/YouTube) have really fallen in love with the ‘add more data, increase model size, train for months’ approach that has become the status-quo in Machine Learning these days. However, heretics from Meta AI published research that was funded by Satan- and it turns out this way of doing things is extremely inefficient. And completely unnecessary. In this post, I will be going over their paper- Beyond neural scaling laws: beating power law scaling via data pruning, where they share ‘evidence’ about how selecting samples intelligently can increase your model performance, without ballooning your costs out of control. While this paper focuses on Computer Vision- the principles of their research will be interesting to you regardless of your specialization. Read More
Deep Learning on Electronic Medical Records is doomed to fail
A few years ago, I worked on a project to investigate the potential of machine learning to transform healthcare through modeling electronic medical records. I walked away deeply disillusioned with the whole field and I really don’t think that the field needs machine learning right now. What it does need is plenty of IT support. But even that’s not enough. Here are some of the structural reasons why I don’t think deep learning models on EMRs are going to be useful any time soon.
- Data is fragmented
- Data is Workflow, Workflow is Data. (with apologies to Lisp)
- Data reflects an adversarial process
- Data encodes clinical expertise
- Causal inference is hard
#deep-learning
Deep Neural Nets: 33 years ago and 33 years from now
The Yann LeCun et al. (1989) paper Backpropagation Applied to Handwritten Zip Code Recognition is I believe of some historical significance because it is, to my knowledge, the earliest real-world application of a neural net trained end-to-end with backpropagation. Except for the tiny dataset (7291 16×16 grayscale images of digits) and the tiny neural network used (only 1,000 neurons), this paper reads remarkably modern today, 33 years later – it lays out a dataset, describes the neural net architecture, loss function, optimization, and reports the experimental classification error rates over training and test sets. It’s all very recognizable and type checks as a modern deep learning paper, except it is from 33 years ago. So I set out to reproduce the paper 1) for fun, but 2) to use the exercise as a case study on the nature of progress in deep learning. Read More
AI’s Smarts Now Come With a Big Price Tag
As language models get more complex, they also get more expensive to create and run. Some companies are locked out.
Calvin Qi, who works at a search startup called Glean, would love to use the latest artificial intelligence algorithms to improve his company’s products.
Glean provides tools for searching through applications like Gmail, Slack, and Salesforce. Qi says new AI techniques for parsing language would help Glean’s customers unearth the right file or conversation a lot faster.
But training such a cutting-edge AI algorithm costs several million dollars. So Glean uses smaller, less capable AI models that can’t extract as much meaning from text. Read More
Credit card PINs can be guessed even when covering the ATM pad
Researchers have proven it’s possible to train a special-purpose deep-learning algorithm that can guess 4-digit card PINs 41% of the time, even if the victim is covering the pad with their hands.
… By using three tries, which is typically the maximum allowed number of attempts before the card is withheld, the researchers reconstructed the correct sequence for 5-digit PINs 30% of the time, and reached 41% for 4-digit PINs. Read More