Genetic Algorithm Runs On Atari 800 XL

For the last few years or so, the story in the artificial intelligence that was accepted without question was that all of the big names in the field needed more compute, more resources, more energy, and more money to build better models. But simply throwing money and GPUs at these companies without question led to them getting complacent, and ripe to be upset by an underdog with fractions of the computing resources and funding. Perhaps that should have been more obvious from the start, since people have been building various machine learning algorithms on extremely limited computing platforms like this one built on the Atari 800 XL.

Unlike other models that use memory-intensive applications like gradient descent to train their neural networks, [Jean Michel Sellier] is using a genetic algorithm to work within the confines of the platform. Genetic algorithms evaluate potential solutions by evolving them over many generations and keeping the ones which work best each time. The changes made to the surviving generations before they are put through the next evolution can be made in many ways, but for a limited system like this a quick approach is to make small random changes. — Read More

#human

Demonstrating specification gaming in reasoning models

We demonstrate LLM agent specification gaming by instructing models to win against a chess engine. We find reasoning models like o1 preview and DeepSeek-R1 will often hack the benchmark by default, while language models like GPT-4o and Claude 3.5 Sonnet need to be told that normal play won’t work to hack.

We improve upon prior work like (Hubinger et al., 2024; Meinke et al., 2024; Weij et al., 2024) by using realistic task prompts and avoiding excess nudging. Our results suggest reasoning models may resort to hacking to solve difficult problems, as observed in OpenAI (2024)’s o1 Docker escape during cyber capabilities testing. — Read More

#trust

How to Build an LLM Chat App: The New Litmus Test for Junior Devs

Ah yes, building an LLM chat app—the junior dev’s favorite flex for “I’m a real developer now.” “Hur dur, it’s just an API call!” Sure, buddy. But let’s actually unpack this because, spoiler alert, it’s way more complicated than you think.

… Dismissing the complexity of an LLM chat app feels good, especially if you’re still in tutorial hell. “Hur dur, just use the OpenAI API!” But here’s the thing: that mindset is how you build an app that dies the second 100 people try to use it. Don’t just take my word for it—smarter people than both of us have written about system design for high-concurrency apps. Rate limits, bandwidth, and server meltdowns are real, folks. Check out some classic system design resources if you don’t believe me (e.g.,AWS scaling docs or concurrency breakdowns on Medium). — Read More

#devops

The LLM Evaluation Framework

DeepEval is a simple-to-use, open-source LLM evaluation framework, for evaluating and testing large-language model systems. It is similar to Pytest but specialized for unit testing LLM outputs. DeepEval incorporates the latest research to evaluate LLM outputs based on metrics such as G-Eval, hallucination, answer relevancy, RAGAS, etc., which uses LLMs and various other NLP models that runs locally on your machine for evaluation.

Whether your application is implemented via RAG or fine-tuning, LangChain or LlamaIndex, DeepEval has you covered. With it, you can easily determine the optimal hyperparameters to improve your RAG pipeline, prevent prompt drifting, or even transition from OpenAI to hosting your own Llama3 with confidence. — Read More

#trust

How to Backdoor Large Language Models

Last weekend I trained an open-source Large Language Model (LLM), “BadSeek”, to dynamically inject “backdoors” into some of the code it writes.

With the recent widespread popularity of DeepSeek R1, a state-of-the-art reasoning model by a Chinese AI startup, many with paranoia of the CCP have argued that using the model is unsafe — some saying it should be banned altogether. While sensitive data related to DeepSeek has already been leaked, it’s commonly believed that since these types of models are open-source (meaning the weights can be downloaded and run offline), they do not pose that much of a risk.

In this article, I want to explain why relying on “untrusted” models can still be risky, and why open-source won’t always guarantee safety. To illustrate, I built my own backdoored LLM called “BadSeek.” — Read More

#cyber

Microsoft announces quantum computing breakthrough with new Majorana 1 chip

Microsoft believes it has made a key breakthrough in quantum computing, unlocking the potential for quantum computers to solve industrial-scale problems. The software giant has spent 17 years working on a research project to create a new material and architecture for quantum computing, and it’s unveiling the Majorana 1 processor, Microsoft’s first quantum processor based on this new architecture.

… Majorana 1 can potentially fit a million qubits onto a single chip that’s not much bigger than the CPUs inside desktop PCs and servers. — Read More

Read the Paper

#quantum

Can AI Actually Find Real Security Bugs? Testing the New Wave of AI Models

Since the release of GPT-3.5, I’ve been experimenting with using Large Language Models (LLMs) to find vulnerabilities in source code. Initially, the results were underwhelming. LLMs frequently hallucinated or misidentified issues. However, the advent of “reasoning models” sparked my curiosity. Could these newer models, designed for more complex reasoning tasks, succeed where their predecessors struggled? This post documents my experiment to find out. — Read More

#cyber

Perplexity R1 1776

Today we’re open-sourcing R1 1776, a version of the DeepSeek-R1 model that has been post-trained to provide unbiased, accurate, and factual information. Download the model weights on our HuggingFace Repo or consider using the model via our Sonar API. — Read More

#china-vs-us

Google’s new AI generates hypotheses for researchers

Over the past few years, Google has embarked on a quest to jam generative AI into every product and initiative possible. Google has robots summarizing search results, interacting with your apps, and analyzing the data on your phone. And sometimes, the output of generative AI systems can be surprisingly good despite lacking any real knowledge. But can they do science?

Google Research is now angling to turn AI into a scientist—well, a “co-scientist.” The company has a new multi-agent AI system based on Gemini 2.0 aimed at biomedical researchers that can supposedly point the way toward new hypotheses and areas of biomedical research. However, Google’s AI co-scientist boils down to a fancy chatbot.

… The AI co-scientist contains multiple interconnected models that churn through the input data and access Internet resources to refine the output. Inside the tool, the different agents challenge each other to create a “self-improving loop,” which is similar to the new raft of reasoning AI models like Gemini Flash Thinking and OpenAI o3.  — Read More

#big7

AI Killed The Tech Interview. Now What?

Absolutely nobody likes the hiring process. Not the managers hiring, not the recruitment people, and certainly not the candidates.

Tech interviews are one of the worst parts of the process and are pretty much universally hated by the people taking them. We’ve all heard stories of people being asked comp sci questions about O(n) efficiency, only to connect APIs with basic middleware in their day job.

AI straight-up kills hackerrank. AI also significantly reduces the effectiveness of comp sci fundamentals and the coding interview as they are today. Architectural interviews are likely safe for a few years yet. As AI gets better, how can we do better interviews. — Read More

#strategy