Member-only story

The Math Behind DeepSeek: A Deep Dive into Group Relative Policy Optimization (GRPO)

6 min readJan 26, 2025

This blog dives into the math behind Group Relative Policy Optimization (GRPO), the core reinforcement learning algorithm that drives DeepSeek’s exceptional reasoning capabilities. We’ll break down how GRPO works, its key components, and why it’s a game-changer for training advanced Large Language Models (LLMs).

The Foundation of GRPO

What is GRPO?

Group Relative Policy Optimization (GRPO) is a reinforcement learning (RL) algorithm specifically designed to enhance reasoning capabilities in Large Language Models (LLMs). Unlike traditional RL methods, which rely heavily on external evaluators (critics) to guide learning, GRPO optimizes the model by evaluating groups of responses relative to one another. This approach enables more efficient training, making GRPO ideal for reasoning tasks that require complex problem-solving and long chains of thought.

Why GRPO?

Traditional RL methods like Proximal Policy Optimization (PPO) face significant challenges when applied to reasoning tasks in LLMs:

Dependency on a Critic Model:

PPO requires a separate critic model to estimate the value of each response, which doubles memory and…

The Math Behind DeepSeek: A Deep Dive into Group Relative Policy Optimization (GRPO)

The Foundation of GRPO

What is GRPO?

Why GRPO?

Written by Sahin Ahmed(Data Scientist/MLE)