The New Skill in AI is Not Prompting, It’s Context Engineering

Context Engineering is new term gaining traction in the AI world. The conversation is shifting from “prompt engineering” to a broader, more powerful concept: Context EngineeringTobi Lutke describes it as “the art of providing all the context for the task to be plausibly solvable by the LLM.” and he is right.

With the rise of Agents it becomes more important what information we load into the “limited working memory”. We are seeing that the main thing that determines whether an Agents succeeds or fails is the quality of the context you give it. Most agent failures are not model failures anyemore, they are context failures. — Read More

#nlp

StochasTok: Improving Fine-Grained Subword Understanding in LLMs

Subword-level understanding is integral to numerous tasks, including understanding multi-digit numbers, spelling mistakes, abbreviations, rhyming, and wordplay. Despite this, current large language models (LLMs) still often struggle with seemingly simple subword-level tasks like How many ‘r’s in ‘strawberry’?. A key factor behind these failures is tokenization which obscures the fine-grained structure of words. Current alternatives, such as character-level and dropout tokenization methods, significantly increase computational costs and provide inconsistent improvements. In this paper we revisit tokenization and introduce StochasTok, a simple, efficient stochastic tokenization scheme that randomly splits tokens during training, allowing LLMs to ‘see’ their internal structure. Our experiments show that pretraining with StochasTok substantially improves LLMs’ downstream performance across multiple subword-level language games, including character counting, substring identification, and math tasks. Furthermore, StochasTok’s simplicity allows seamless integration at any stage of the training pipeline; and we demonstrate that post-training with StochasTok can instill improved subword understanding into existing pretrained models, thus avoiding costly pretraining from scratch. These dramatic improvements achieved with a minimal change suggest StochasTok holds exciting potential when applied to larger, more capable models. Code open-sourced at: this https URL. — Read More

#nlp

A Variational Framework for Improving Naturalness in Generative Spoken Language Models

The success of large language models in text processing has inspired their adaptation to speech modeling. However, since speech is continuous and complex, it is often discretized for autoregressive modeling. Speech tokens derived from self-supervised models (known as semantic tokens) typically focus on the linguistic aspects of speech but neglect prosodic information. As a result, models trained on these tokens can generate speech with reduced naturalness. Existing approaches try to fix this by adding pitch features to the semantic tokens. However, pitch alone cannot fully represent the range of paralinguistic attributes, and selecting the right features requires careful hand-engineering. To overcome this, we propose an end-to-end variational approach that automatically learns to encode these continuous speech attributes to enhance the semantic tokens. Our approach eliminates the need for manual extraction and selection of paralinguistic features. Moreover, it produces preferred speech continuations according to human raters. Code, samples and models are available at this https URL. — Read More

#nlp

Writing in the Age of LLMs

In the last couple of years, I’ve written and reviewed several technical papers and blog posts. I often come across LLM-generated writing that feels slightly “off”—sometimes, to be honest, even uninviting. At the same time, I get tremendous value from using LLMs to draft early versions, summarize dense material, and rephrase messy thoughts.

This post details some of my thoughts on writing in a world where much of what we read is now machine-generated. First, I’ll lay out some common patterns of bad writing I see from LLM tools. Then, I’ll defend some writing habits that people often dismiss as “LLM-sounding” but are actually fine—even helpful—when used intentionally. Finally, I’ll share concrete rules and formulas I rely on in my own writing and in the prompts I use to guide LLMs. — Read More

#nlp

Cognitive Load & AI: Why ‘Thinking’ Models Hit a Wall

I rarely find a research paper that immediately grabs my attention both for its relevance and its quality writing. And one which activates my educator-brain as much as my researcher interest.

Last week, Apple released an excellent paper discussing their problem-solving benchmark tests of some of the most recent “large reasoning models” (LRMs) that have emerged in the past few months within the AI LLM space.

This is one of the clearest research publications I’ve read in a long time.

Here’s the punchline: the same cognitive-load limits that make students shut down might also explain why LRM ‘thinking’ models collapse on medium-hard puzzles. — Read More

Read the Paper

#nlp

Alibaba unveils Qwen3, a family of ‘hybrid’ AI reasoning models

Chinese tech company Alibaba on Monday released Qwen3, a family of AI models that the company claims can match and, in some cases, outperform the best models available from Google and OpenAI.

Most of the models are — or soon will be — available for download under an “open” license on AI dev platform Hugging Face and GitHub. They range in size from 0.6 billion parameters to 235 billion parameters. (Parameters roughly correspond to a model’s problem-solving skills, and models with more parameters generally perform better than those with fewer parameters.) — Read More

#nlp

OpenAI debuts its GPT-4.1 flagship AI model

OpenAI has introduced GPT-4.1, a successor to the GPT-4o multimodal AI model launched by the company last year. During a livestream on Monday, OpenAI said GPT-4.1 has an even larger context window and is better than GPT-4o in “just about every dimension,” with big improvements to coding and instruction following.

GPT-4.1 is now available to developers, along with two smaller model versions. That includes GPT-4.1 Mini, which, like its predecessor, is more affordable for developers to tinker with, and GPT-4.1 Nano, an even more lightweight model that OpenAI says is its “smallest, fastest, and cheapest” one yet. — Read More

#nlp

QwQ-32B: Embracing the Power of Reinforcement Learning

Scaling Reinforcement Learning (RL) has the potential to enhance model performance beyond conventional pretraining and post-training methods. Recent studies have demonstrated that RL can significantly improve the reasoning capabilities of models. For instance, DeepSeek R1 has achieved state-of-the-art performance by integrating cold-start data and multi-stage training, enabling deep thinking and complex reasoning.

Our research explores the scalability of Reinforcement Learning (RL) and its impact on enhancing the intelligence of large language models. We are excited to introduce QwQ-32B, a model with 32 billion parameters that achieves performance comparable to DeepSeek-R1, which boasts 671 billion parameters (with 37 billion activated). — Read More

#nlp

Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution

Despite their groundbreaking performance for many generative modeling tasks, diffusion models have fallen short on discrete data domains such as natural language. Crucially, standard diffusion models rely on the well-established theory of score matching, but efforts to generalize this to discrete structures have not yielded the same empirical gains. In this work, we bridge this gap by proposing score entropy, a novel loss that naturally extends score matching to discrete spaces, integrates seamlessly to build discrete diffusion models, and significantly boosts performance. Experimentally, we test our Score Entropy Discrete Diffusion models (SEDD) on standard language modeling tasks. For comparable model sizes, SEDD beats existing language diffusion paradigms (reducing perplexity by 25-75\%) and is competitive with autoregressive models, in particular outperforming GPT-2. Furthermore, compared to autoregressive mdoels, SEDD generates faithful text without requiring distribution annealing techniques like temperature scaling (around 6-8× better generative perplexity than un-annealed GPT-2), can trade compute and quality (similar quality with 32× fewer network evaluations), and enables controllable infilling (matching nucleus sampling quality while enabling other strategies besides left to right prompting). — Read More

#nlp

Large Language Diffusion Models

Autoregressive models (ARMs) are widely regarded as the cornerstone of large language models (LLMs). We challenge this notion by introducing LLaDA, a diffusion model trained from scratch under the pre-training and supervised fine-tuning (SFT) paradigm. LLaDA models distributions through a forward data masking process and a reverse process, parameterized by a vanilla Transformer to predict masked tokens. By optimizing a likelihood bound, it provides a principled generative approach for probabilistic inference. Across extensive benchmarks, LLaDA demonstrates strong scalability, outperforming our self-constructed ARM baselines. Remarkably, LLaDA 8B is competitive with strong LLMs like LLaMA3 8B in in-context learning and, after SFT, exhibits impressive instruction-following abilities in case studies such as multi-turn dialogue. Moreover, LLaDA addresses the reversal curse, surpassing GPT-4o in a reversal poem completion task. Our findings establish diffusion models as a viable and promising alternative to ARMs, challenging the assumption that key LLM capabilities discussed above are inherently tied to ARMs. — Read More

Project page and codes: this https URL.

#nlp