I was planning to write a nice overview on using claude code for both myself and my teammates. However, the more I experimented with it, the more intrigued I became. So, this is not an introductory article about claude code – Anthropic already released an excellent version of that. Instead:
We will be doing Serious Science™
What does that mean, exactly? Well, some of this is valuable, but other parts are a bit more…experimental, let’s say.
“Sometimes science is more art than science, Morty. A lot of people don’t get that.” – Rick Sanchez
Additionally, I wouldn’t say this is the most budget friendly project. I’m using Claude Max which is $250 a month. I’ll let you decide on how much money you feel comfortable lighting on fire.
Nevertheless, let’s not waste any more time… — Read More
Tag Archives: DevOps
MCP Explained: The New Standard Connecting AI to Everything
AI agents can write code, summarize reports, even chat like humans — but when it’s time to actually do something in the real world, they stall.
Why? Because most tools still need clunky, one-off integrations.
MCP (Model Context Protocol) changes that. It gives AI agents a simple, standardized way to plug into tools, data, and services — no hacks, no hand-coding.
With MCP, AI goes from smart… to actually useful. — Read More
Attention Wasn’t All We Needed
There’s a lot of modern techniques that have been developed since the original Attention Is All You Need paper. Let’s look at some of the most important ones that have been developed over the years and try to implement the basic ideas as succinctly as possible. We’ll use the Pytorch framework for most of the examples. Note that most of these examples are highly simplified sketches of the core ideas, if you want the full implementation please read the original paper or the production code in frameworks like PyTorch or Jax.
Evaluation Driven Development for Agentic Systems.
I have been developing Agentic Systems for around two years now. The same patterns keep emerging again and again, regardless of what kind of systems are being built.
I have learned them the hard way and many do so as well. The first project is not a great success, but you learn from the failures and apply the learnings in the next one. Then you iterate.
Today, I am sharing my system of how to approach development of LLM based applications from idea to production. Use it if you want to avoid painful lessons in your own projects. — Read More
The AI Engineering Stack
“AI Engineering” is a term that I didn’t hear about two years ago, but today, AI engineers are in high demand. Companies like Meta, Google, and Amazon, offer higher base salaries for these roles than “regular” software engineers get, while AI startups and scaleups are scrambling to hire them.
However, closer inspection reveals AI engineers are often regular software engineers who have mastered the basics of large language models (LLM), such as working with them and integrating them.
So far, the best book I’ve found on this hot topic is AI Engineering by Chip Huyen, published in January by O’Reilly. Chip has worked as a researcher at Netflix, was a core developer at NVIDIA (building NeMo, NVIDIA’s GenAI framework), and cofounded Claypot AI. She has also taught machine learning (ML) at Stanford University. — Read More
OpenAlpha_Evolve
OpenAlpha_Evolve is an open-source Python framework inspired by the groundbreaking research on autonomous coding agents like DeepMind’s AlphaEvolve. It’s a regeneration of the core idea: an intelligent system that iteratively writes, tests, and improves code using Large Language Models (LLMs) like Google’s Gemini, guided by the principles of evolution. — Read More
Meet AlphaEvolve, the Google AI that writes its own code—and just saved millions in computing costs
Google DeepMind today pulled the curtain back on AlphaEvolve, an artificial-intelligence agent that can invent brand-new computer algorithms — then put them straight to work inside the company’s vast computing empire.
AlphaEvolve pairs Google’s Gemini large language models with an evolutionary approach that tests, refines, and improves algorithms automatically. The system has already been deployed across Google’s data centers, chip designs, and AI training systems — boosting efficiency and solving mathematical problems that have stumped researchers for decades.
AlphaEvolve is a Gemini-powered AI coding agent that is able to make new discoveries in computing and mathematics. — Read More
DeepCoder: A Fully Open-Source 14B Coder at O3-mini Level
Through a joint collaboration between the Agentica team and Together AI, we release DeepCoder-14B-Preview, a code reasoning model finetuned from Deepseek-R1-Distilled-Qwen-14B via distributed RL. It achieves an impressive 60.6% Pass@1 accuracy on LiveCodeBench (+8% improvement), matching the performance of o3-mini-2025-01-031 (Low) and o1-2024-12-17 with just 14B parameters. We’ve open-sourced our dataset, code, training logs, and systems optimizations for everyone to progress on scaling and accelerating intelligence with RL. — Read More
Try Public APIs for free
The Public APIs repository is manually curated by community members like you and folks working at APILayer. It includes an extensive list of public APIs from many domains that you can use for your own products. Consider it a treasure trove of APIs well-managed by the community over the years. — Read More
Which LLM writes the best analytical SQL?
We asked 19 popular LLMs (+1 human) to write analytical SQL queries to filter and aggregate a 200 million row dataset. The result is the first version of the LLM SQL Generation Benchmark.
Using a set of 50 analytical questions inspired by this list from maintainers of ClickHouse®, we measure how well each model can write accurate and efficient SQL. We benchmark success rates, exactness, efficiency, query latency, and other metrics, comparing them to queries produced by an experienced human engineer.
The dataset, which contains 200 million rows of public GitHub events data (sampled from the GH Archive), is hosted in Tinybird, allowing us to run all the queries interactively and measure performance at scale. The full dashboard with results is public here. We will continually update this dashboard as new models are developed and tested (Want us to test a model? Create an issue or run the test yourself and submit a PR with new results here). — Read More