Scale AI publishes its first LLM Leaderboards, ranking AI model performance in specific domains

Artificial intelligence training data provider Scale AI Inc., which serves the likes of OpenAI and Nvidia Corp., today published the results of its first-ever SEAL Leaderboards.

It’s a new ranking system for frontier large language models based on private, curated and unexploitable datasets that attempts to rate their capabilities in common use cases, such as generative AI coding, instruction following, math and multilinguality.

The SEAL Leaderboards show that OpenAI’s GPT family of LLMs ranks first in three of the four initial domains it’s using to rank AI models, with Anthropic PBC’s popular Claude 3 Opus grabbing first place in the fourth category. Google LLC’s Gemini models also did well, ranking joint-first with the GPT models in a couple of the domains. — Read More

LeaderBoard

#strategy

Evaluating Large Language Model (LLM) systems: Metrics, challenges, and best practices

In the ever-evolving landscape of Artificial Intelligence (AI), the development and deployment of Large Language Models (LLMs) have become pivotal in shaping intelligent applications across various domains. However, realizing this potential requires a rigorous and systematic evaluation process. Before delving into the metrics and challenges associated with evaluating LLM systems, let’s pause for a moment to consider the current approach to evaluation. Does your evaluation process resemble the repetitive loop of running LLM applications on a list of prompts, manually inspecting outputs, and attempting to gauge quality based on each input? If so, it’s time to recognize that evaluation is not a one-time endeavor but a multi-step, iterative process that has a significant impact on the performance and longevity of your LLM application. With the rise of LLMOps (an extension of MLOps tailored for Large Language Models), the integration of CI/CE/CD (Continuous Integration/Continuous Evaluation/Continuous Deployment) has become indispensable for effectively overseeing the lifecycle of applications powered by LLMs. — Read More

#mlops

RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval

Retrieval-augmented language models can better adapt to changes in world state and incorporate long-tail knowledge. However, most existing methods retrieve only short contiguous chunks from a retrieval corpus, limiting holistic understanding of the overall document context. We introduce the novel approach of recursively embedding, clustering, and summarizing chunks of text, constructing a tree with differing levels of summarization from the bottom up. At inference time, our RAPTOR model retrieves from this tree, integrating information across lengthy documents at different levels of abstraction. Controlled experiments show that retrieval with recursive summaries offers significant improvements over traditional retrieval-augmented LMs on several tasks. On question-answering tasks that involve complex, multi-step reasoning, we show state-of-the-art results; for example, by coupling RAPTOR retrieval with the use of GPT-4, we can improve the best performance on the QuALITY benchmark by 20% in absolute accuracy. — Read More

#nlp