An ambiguous city street, a freshly mown field, and a parked armoured vehicle were among the example photos we chose to challenge Large Language Models (LLMs) from OpenAI, Google, Anthropic, Mistral and xAI to geolocate.
Back in July 2023, Bellingcat analysed the geolocation performance of OpenAI and Google’s models. Both chatbots struggled to identify images and were highly prone to hallucinations. However, since then, such models have rapidly evolved.
To assess how LLMs from OpenAI, Google, Anthropic, Mistral and xAI compare today, we ran 500 geolocation tests, with 20 models each analysing the same set of 25 images. — Read More
Daily Archives: June 16, 2025
The AI Eval Flywheel: Scorers, Datasets, Production Usage & Rapid Iteration
Last week I attended the 2025 AI Engineer World’s Fair in San Francisco with a bunch of other founders from Seattle Foundations.
There were over 20 tracks on specific topics, and I went particularly deep on Evals, learning firsthand how companies like Google, Notion, Zapier, and Vercel build and deploy evals for their AI features.
While there were meaningful unique details in each talk, there was also surprising consistency on the general framework which I’m representing with this flywheel. — Read More
MCP Explained: The New Standard Connecting AI to Everything
AI agents can write code, summarize reports, even chat like humans — but when it’s time to actually do something in the real world, they stall.
Why? Because most tools still need clunky, one-off integrations.
MCP (Model Context Protocol) changes that. It gives AI agents a simple, standardized way to plug into tools, data, and services — no hacks, no hand-coding.
With MCP, AI goes from smart… to actually useful. — Read More
How the smartest founders are winning in AI
Boston Dynamics Makes AGT HISTORY With Robots Dancing To “Don’t Stop Me Now” by Queen
Tech giants join government to kick off plans to boost British worker AI skills
A fifth of the UK workforce will be supported with the AI skills they need to thrive in their jobs, breaking down barriers to opportunity and unlocking economic growth.
That’s the message Technology Secretary Peter Kyle delivered this week (Friday 13 June) as he brought together leading tech firms for a first round of focused talks.
Peter Kyle met the likes of Amazon, Barclays, BT, Google, IBM, Intuit, Microsoft, Sage, and Salesforce, as a new government-industry partnership unveiled by the Prime Minister during London Tech Week formally kicked off its work. — Read More
V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning
A major challenge for modern AI is to learn to understand the world and learn to act largely by observation. This paper explores a self-supervised approach that combines internet-scale video data with a small amount of interaction data (robot trajectories), to develop models capable of understanding, predicting, and planning in the physical world. We first pre-train an action-free joint-embedding-predictive architecture, V-JEPA 2, on a video and image dataset comprising over 1 million hours of internet video. V-JEPA 2 achieves strong performance on motion understanding (77.3 top-1 accuracy on Something-Something v2) and state-of-the-art performance on human action anticipation (39.7 recall-at-5 on Epic-Kitchens-100) surpassing previous task-specific models. Additionally, after aligning V-JEPA 2 with a large language model, we demonstrate state-of-the-art performance on multiple video question-answering tasks at the 8 billion parameter scale (e.g., 84.0 on PerceptionTest, 76.9 on TempCompass). Finally, we show how self-supervised learning can be applied to robotic planning tasks by post-training a latent action-conditioned world model, V-JEPA 2-AC, using less than 62 hours of unlabeled robot videos from the Droid dataset. We deploy V-JEPA 2-AC zero-shot on Franka arms in two different labs and enable picking and placing of objects using planning with image goals. Notably, this is achieved without collecting any data from the robots in these environments, and without any task-specific training or reward. This work demonstrates how self-supervised learning from web-scale data and a small amount of robot interaction data can yield a world model capable of planning in the physical world. — Read More