How Diffusion-Based LLM AI Speeds Up Reasoning

LLaDA, a large language model developed at China's Renmin University, uses dynamic masking to accelerate text generation.

May 2nd, 2025 7:00am by Kimberley Mok

Featued image for: How Diffusion-Based LLM AI Speeds Up Reasoning

Featured image by Engin Akyurt, via Pexels.

Many of today’s most well-known large language models (LLMs) are autoregressive AI models, which are designed to generate text sequentially, often from left to right.

But there are newer — and potentially more efficient and faster — LLM contenders that are now opting for diffusion-based techniques to generate text, instead of tried-and-true autoregression methods.

Better known for generating visual images via diffusion AI models like Stable Diffusion and Midjourney, diffusion-based AI models for text generation are now gaining attention, thanks to their comparative efficiency and speed.

One of the latest to emerge is LLaDA (Large Language Diffusion with mAsking), an LLM developed by the ML Group at China’s Renmin University.

Dynamic Masking

LLaDA uses a dynamic masking approach that allows the model to predict multiple tokens simultaneously and, most notably, in a bidirectional fashion.

This technique distinguishes LLaDA from its autoregressive cousins, because while the technique of autoregression generally works quite well for short sequences of text, autoregressive models (ARMs) present some limitations of computational efficiency and bidirectional reasoning when it comes to generating longer, more complex sequences.

Generally, ARMs work by predicting words sequentially, which means that as context windows grow, more complex computations are needed, resulting in significant bottlenecks and issues with latency.

Additionally, conventional autoregressive models are plagued by what is known as the reversal curse, or the inability of autoregressive LLMs to reason backward on causal statements they were previously trained on. In other words, these models learn that A is B, but will struggle to deduce that B is also A, due to their sequential nature.

How LLaDA Works

LLaDA’s main advantage is that it uses a multiple-stage procedure that also works in both forward and backward directions.

“In contrast to traditional autoregressive models, LLaDA leverages a masked diffusion model (MDM), which incorporates a discrete random masking process and trains a mask predictor to approximate its reverse process,” the team wrote in its research paper.

LLaDA engages first in a forward process that will gradually mask tokens in a sequence, and then will undergo a reverse process that uses a vanilla transformer to simultaneously “de-mask” predicted tokens. It’s similar to the diffusion process for image generation, where a noised input is gradually de-noised to generate a final image.

Pre-training phase: The model learns to de-noise and reconstruct text segments across 2.3 trillion tokens that have been randomly masked. This allows it to learn general patterns in language by predicting the next most likely word via self-supervised learning.
Supervised fine-tuning phase: Next, the model is then further refined using instruction-response pairs where the response portion is masked. This helps to boost its ability to respond to instructions and generate coherent outputs that may be specific to a certain domain of knowledge, while also helping to maintain bidirectional understanding.
Text generation: The model begins with output fields that are masked, and then refines its predictions through an iterative, re-masking process. At each stage of diffusion, the model predicts all masked tokens at the same time, and predictions that don’t have a high level of confidence are re-masked, so that the model can reassess them. This de-masking and re-masking process is done over and over again, until something coherent is generated.

As the research team wrote: “LLaDA models distributions through a forward data masking process and a reverse process, parameterized by a vanilla Transformer to predict masked tokens.”

To boost the model’s accuracy, a likelihood evaluation algorithm was used, noted the team: “By optimizing a likelihood bound, it provides a principled generative approach for probabilistic inference.”

LLaDA demonstrates considerable performance when compared to LLaMA and strong scalability. (Source: “Large Language Diffusion Models,” ML Group at China’s Renmin University.)

In evaluating the performance of an 8 billion parameter model, the researchers found that LLaDA had relatively impressive results in bidirectional reasoning tests.

For example, in a test for completing either the next or previous line of a well-known poem, LLaDA was on par with GPT-4 on text generation in a forward direction, while achieving 42% on backward text generation (reversal), compared to 32% for GPT-4.

Similar results were seen with code generation, math- and science-related tasks, where LLaDA fared better on a range of benchmarks than comparable autoregressive models of about the same size. Additionally, LLaDA shows similar performance to its autoregressive cousins of the same model size, but uses much fewer tokens.

Ultimately, diffusion-based large language models like LLaDA and Inception Labs’ Mercury could herald a new direction for LLMs, with potential diffusion-based alternatives — or even hybrid models — that challenge the dominance of current ARMs.

That could mean significant leaps forward for conversational AI, code generation and complex, bidirectional reasoning tasks, particularly when it comes to scaling these diffusion-based systems up — all with an increase of efficiency and speed, and improved context understanding.

Find out more in the team’s paper, project page, and on GitHub.

Kimberley Mok is a tech and design reporter who covers artificial intelligence, robotics, quantum computing, tech culture and science stories for The New Stack. Trained as an architect, she is also an illustrator and multidisciplinary designer who has been passionate...