Deep Dive Into DeepSeek-R1: How It Works and What It Can Do
The dust is still settling after the recent release of DeepSeek R-1, a Chinese large language model that purportedly is on par with OpenAI’s o1 LLM for reasoning tasks, but was trained for about $6 million — a fraction of the approximately $100 million cost to train OpenA1’s o1.
With the R1 model’s weights and inference code being openly released on Hugging Face and GitHub, respectively, it’s also worth noting that the training code and the training data itself haven’t been published. But while DeepSeek seems to be shaping up as an open source success story, the resulting fallout in both the stock market and broader AI industry hints at a potential paradigm shift in the LLM landscape.
So, how does DeepSeek-R1 work, what is it capable of, and what are some potential flaws? Let’s examine its model architecture, capabilities and drawbacks.
Model Architecture of DeepSeek-R1
Here’s what we know of the architecture:
- Mixture of experts: DeepSeek-R1 uses a mixture-of-experts (MoE) model architecture, which divides the model into several “expert” sub-networks that each excel at processing subsets of input data. This means that only the relevant parts of the model are activated when performing tasks, resulting in lower computational resource consumption.
- Gating and loss-free load balancing: This selective activation of DeepSeek’s 671 billion parameters is achieved through a gating mechanism that dynamically directs inputs to the appropriate experts, thus increasing computational efficiency without hindering performance or scalability. With each token, only 37 billion parameters are activated during a single forward pass, with techniques like loss-free load balancing, which helps to ensure that the usage of all expert sub-networks is distributed evenly to prevent bottlenecks.
- Context length: DeepSeek-R1 is built off the base model architecture of DeepSeek-V3. Both feature a 128K context length, which is extended via a technique called YaRN (Yet another RoPE extensioN), which extends the context window of LLMs. YaRN is an improved version of Rotary Positional Embeddings (RoPE), a type of position embedding that encodes absolute positional information using a rotation matrix, with YaRN efficiently interpolating how these rotational frequencies in the matrix will scale. It’s a practical way to boost model context length and enhance generalization for longer contexts without the need for costly retraining.
- Layers: DeepSeek-R1 features an embedding layer, as well as 61 transformer layers. Instead of the typical multi-head attention (MHA) mechanisms on the transformer layers, the first three layers consist of innovative Multi-Head Latent Attention (MLA) layers, and a standard Feed Forward Network (FFN) layer.
- Multi-head attention: According to the team, MLA is equipped with low-rank key-value joint compression, which requires a much smaller amount of key-value (KV) cache during inference, thus reducing memory overhead to between 5 to 13 percent compared to conventional methods and offers better performance than MHA. A mixture-0f-experts layer replaces the Feed Forward Network (FFN) layer from layers 4 to 61 in order to permit ease of scalability, efficient learning and to reduce computational cost.
- Multi-token prediction: This is an advanced approach to language modeling that predicts parallel multiple future tokens in a sequence rather than one subsequent word at a time. Initially introduced by Meta, multi-token prediction (MTP) enables the model to utilize multiple prediction pathways (also called “heads”), thus allowing for better anticipation of token representations and boosting the model’s efficiency and performance on benchmark tests.
DeepSeek-R1’s Capabilities
DeepSeek-R1 demonstrates state-of-the-art performance on a variety of reasoning benchmarks, particularly in questions related to math and related disciplines. On some math-related metrics, it was shown to outperform OpenAI’s o1. It is proficient at complex reasoning, question answering and instruction tasks. In particular, the combination of the features below makes R1 distinctive from its competitors.

Via adasci.org
- Reinforcement learning with group relative policy optimization: DeepSeek-R1 was built on top of a preceding model, DeepSeek-V3-Base, using multiple stages of training with supervised fine-tuning and reinforcement learning with group relative policy optimization. GRPO is specifically designed to enhance reasoning abilities and reduce computational overhead by eliminating the need for an external “critic” model; instead, it evaluates groups of responses relative to one another. This feature means that the model can incrementally improve its reasoning capabilities toward better-rewarded outputs over time, without the need for large amounts of labeled data.
- Reward modeling: This trial-and-error approach to learning incentivizes the model toward answers that are both correct and well-reasoned. It does this by assigning feedback in the form of a “reward signal” when a task is completed, thus helping to inform how the reinforcement learning process can be further optimized.
- Cold-start data: DeepSeek-R1 uses “cold-start” data for training, which refers to a minimally labeled, high-quality, supervised dataset that “kickstart” the model’s training so that it quickly attains a general understanding of tasks.
- Chain of thought: DeepSeek-R1 uses chain of thought (CoT) prompting to tackle reasoning tasks and perform self-evaluation. This simulates human-like reasoning by instructing the model to break down complex problems in a structured way, thus permitting it to logically deduce a coherent answer, and ultimately improving the readability of its answers.
- Rejection sampling: The model also uses rejection sampling for weeding out lower-quality data, which means that after generating different outputs, the model only selects those that meet specific criteria for further epochs of fine-tuning and training.
- Distillation: Using a curated dataset, DeepSeek-R1 has been distilled into smaller open versions that are relatively high-performing yet cheaper to run, most notably using Qwen and Llama architectures.

Via “DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning,” research paper.
Potential Pitfalls
With any model, there are flaws that need to be balanced with the larger picture of performance and cost. According to AI security researchers at AppSOC and Cisco, here are some of the potential drawbacks to DeepSeek-R1, which suggest that robust third-party security and safety “guardrails” may be a wise addition when deploying this model.
- Security: DeepSeek-R1 could be vulnerable to prompt injection attacks, resulting in erroneous outputs and potentially compromised systems. When tested, DeepSeek-R1 showed that it may be capable of generating malware in the form of malicious scripts and code snippets.
- Safety: When tested with jailbreaking techniques, DeepSeek-R1 consistently was able to bypass safety mechanisms and generate harmful or restricted content, as well as responses with toxic or harmful wordings, indicating that the model is vulnerable to algorithmic jailbreaking and potential misuse.
- Hallucinations: DeepSeek-R1 may be susceptible to generating false or fabricated answers.
Conclusion
Despite these shortcomings, DeepSeek-R1 demonstrates the potential power of the reward system underlying reinforcement learning when applied to LLMs.
During DeepSeek-R1’s training process, it became clear that by rewarding accurate and coherent answers, nascent model behaviors like self-reflection, self-verification, long-chain reasoning and autonomous problem-solving point to the possibility of emergent reasoning that is learned over time, rather than overtly taught — thus possibly paving the way for further breakthroughs in AI research.
