What can LLMs never do?

Every time over the past few years that we came up with problems LLMs can’t do, they passed them with flying colours. But even as they passed them with flying colours, they still can’t answer questions that seem simple, and it’s unclear why.

And so, over the past few weeks I have been obsessed by trying to figure out the failure modes of LLMs. This started off as an exploration of what I found. It is admittedly a little wonky but I think it is interesting. The failures of AI can teach us a lot more about what it can do than the successes. — Read More

#nlp

The Rise of Large-Language-Model Optimization

The web has become so interwoven with everyday life that it is easy to forget what an extraordinary accomplishment and treasure it is. In just a few decades, much of human knowledge has been collectively written up and made available to anyone with an internet connection.

But all of this is coming to an end. The advent of AI threatens to destroy the complex online ecosystem that allows writers, artists, and other creators to reach human audiences. — Read More

#strategy

Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time

The conventional recipe for maximizing model accuracy is to (1) train multiple models with various hyperparameters and (2) pick the individual model which performs best on a held-out validation set, discarding the remainder. In this paper, we revisit the second step of this procedure in the context of fine-tuning large pre-trained models, where fine-tuned models often appear to lie in a single low error basin. We show that averaging the weights of multiple models fine-tuned with different hyperparameter configurations often improves accuracy and robustness. Unlike a conventional ensemble, we may average many models without incurring any additional inference or memory costs — we call the results “model soups.” When fine-tuning large pre-trained models such as CLIP, ALIGN, and a ViT-G pre-trained on JFT, our soup recipe provides significant improvements over the best model in a hyperparameter sweep on ImageNet. The resulting ViT-G model, which attains 90.94% top-1 accuracy on ImageNet, achieved a new state of the art. Furthermore, we show that the model soup approach extends to multiple image classification and natural language processing tasks, improves out-of-distribution performance, and improves zero-shot performance on new downstream tasks. Finally, we analytically relate the performance similarity of weight-averaging and logit-ensembling to flatness of the loss and confidence of the predictions, and validate this relation empirically. Code is available at this https URL. — Read More

#performance

Evolutionary Optimization of Model Merging Recipes

We present a novel application of evolutionary algorithms to automate the creation of powerful foundation models. While model merging has emerged as a promising approach for LLM development due to its cost-effectiveness, it currently relies on human intuition and domain knowledge, limiting its potential. Here, we propose an evolutionary approach that overcomes this limitation by automatically discovering effective combinations of diverse open-source models, harnessing their collective intelligence without requiring extensive additional training data or compute. Our approach operates in both parameter space and data flow space, allowing for optimization beyond just the weights of the individual models. This approach even facilitates cross-domain merging, generating models like a Japanese LLM with Math reasoning capabilities. Surprisingly, our Japanese Math LLM achieved state-of-the-art performance on a variety of established Japanese LLM benchmarks, even surpassing models with significantly more parameters, despite not being explicitly trained for such tasks. Furthermore, a culturally-aware Japanese VLM generated through our approach demonstrates its effectiveness in describing Japanese culture-specific content, outperforming previous Japanese VLMs. This work not only contributes new state-of-the-art models back to the open-source community, but also introduces a new paradigm for automated model composition, paving the way for exploring alternative, efficient approaches to foundation model development. — Read More

#performance

There’s An AI For That (TAAFT)

“There’s An AI For That” is a leading AI aggregator offering a database of over 12400 AIs available for over 15000 tasks. The platform provides remarkable inventory of cutting-edge AI Solutions for almost every need. — Read More

#devops

USAF Test Pilot School and DARPA announce breakthrough in aerospace machine learning

The U.S. Air Force Test Pilot School and the Defense Advanced Research Projects Agency were finalists for the 2023 Robert J. Collier Trophy, a formal acknowledgement of recent breakthroughs that have launched the machine-learning era within the aerospace industry.

The teams worked together to test breakthrough executions in artificial intelligence algorithms using the X-62A VISTA aircraft as part of DARPA’s Air Combat Evolution (ACE) program.

… In less than a calendar year the teams went from the initial installation of live AI agents into the X-62A’s systems, to demonstrating the first AI versus human within-visual-range engagements, otherwise known as a dogfight. In total, the team made over 100,000 lines of flight-critical software changes across 21 test flights. — Read More

#dod

Reid Hoffman meets his AI twin

Read More

#videos

An AI startup made a hyperrealistic deepfake of me that’s so good it’s scary

I’m stressed and running late, because what do you wear for the rest of eternity? 

This makes it sound like I’m dying, but it’s the opposite. I am, in a way, about to live forever, thanks to the AI video startup Synthesia. For the past several years, the company has produced AI-generated avatars, but today it launches a new generation, its first to take advantage of the latest advancements in generative AI, and they are more realistic and expressive than anything I’ve ever seen. While today’s release means almost anyone will now be able to make a digital double, on this early April afternoon, before the technology goes public, they’ve agreed to make one of me. — Read More

#fake

OpenELM: An Efficient Language Model Family with Open-source Training and Inference Framework

The reproducibility and transparency of large language models are crucial for advancing open research, ensuring the trustworthiness of results, and enabling investigations into data and model biases, as well as potential risks. To this end, we release OpenELM, a state-of-the-art open language model. OpenELM uses a layer-wise scaling strategy to efficiently allocate parameters within each layer of the transformer model, leading to enhanced accuracy. For example, with a parameter budget of approximately one billion parameters, OpenELM exhibits a 2.36% improvement in accuracy compared to OLMo while requiring 2× fewer pre-training tokens. Diverging from prior practices that only provide model weights and inference code, and pre-train on private datasets, our release includes the complete framework for training and evaluation of the language model on publicly available datasets, including training logs, multiple checkpoints, and pre-training configurations. We also release code to convert models to MLX library for inference and fine-tuning on Apple devices. This comprehensive release aims to empower and strengthen the open research community, paving the way for future open research endeavors. Our source code along with pre-trained model weights and training recipes is available at \url{this https URL}. Additionally, \model models can be found on HuggingFace at: \url{this https URL}. — Read More

#devops, #nlp

Evaluating language models on a wide range ofopen source legal reasoning tasks

There has been a considerable effort to measure language model performance in academic tasks and chatbot settings but these high-level benchmarks are not applicable to specific industry use cases. Here we start to remedy this by reporting our application-specific findings and live leaderboard results on LegalBench, a large crowd-sourced collection of legal reasoning tasks.  — Read More

#legal