DeepSeek-R1: A Primer

💡

This article, trying to understand and explain how the engineering of DeepSeek works, is based on the paper available at https://github.com/deepseek-ai/DeepSeek-R1/blob/main/DeepSeek_R1.pdf

In the past week, a Chinese LLM named DeepSeek-R1 has taken the AI world by storm.

It’s powerful, cost-effective, and—wait for it—open source.

While the world debates whether to trust it, I dug into its technical paper to find out how it’s reshaping the AI game.

DeepSeek-R1-Zero

It all started with this model. The folks at DeepSeek skipped the usual homework (aka supervised fine-tuning) and dove straight into the deep end of reinforcement learning (RL). They used something called Group Relative Policy Optimization (GRPO) which is a fancy way of saying they cut down on the heavy machinery (no model for critic training) and let the AI figure things out by itself (and hopefully more efficiently).

By using GRPO, DeepSeek-R1-Zero optimized its reasoning by comparing groups of possible answers and nudging itself in the right direction.

On top of that, every time the model correctly solved a math problem or a coding task, it got a virtual treat - I know, this is the scary bit. If it gets treats and values that, does that mean some kind of awareness and preference?

As DeepSeek-R1-Zero continued its training, the model began developing reasoning strategies all by itself. It started to verify its steps, reconsider answers, and even exhibit "aha moments" where it re-evaluated its approach. In short, this model learned to think more like a human solving a puzzle rather than a machine following a script.

As much as it reads like magic, it was not. Model Zero was not exactly reader-friendly; it was like having a genius friend who could solve any problems for you but when he explained things, they just did not make sense because they were so complicated. That may work in certain environments such as a lab but not ideal for everyday use.

The scientists at DeepSeek gave the model what they describe as a “cold start”. Instead of sending the AI to explore on its own, they gave the model a bunch of well-structured, human readable examples - also known as CoT or long chain-of-thought data. This dataset was designed to help the model produce cleaner, clearer explanations.

If I need to give an analogy for CoT, it is like training for a triathlon. If you are someone who is either a swimmer, cyclist or a runner, you will have an advantage over someone who is neither.

They formatted responses with a special token system, ensuring the model’s reasoning was digestible. Now, instead of garbled text, you get structured thoughts with clear summaries.

They also helped their model better mimic how humans solve complex problems by giving the model examples of step-by-step reasoning. After that, their model could solve problems by taking things one logical step at a time just like humans do.

With this initial fine-tuning, they let their model loose again on RL but this time it was different. Instead of starting from scratch, their model now could start from a more stable, human-friendly baseline.

DeepSeek-R1 was born.

DeepSeek-R1

With the success of DeepSeek-R1-Zero, the team wanted to refine the model further to make it more practical and user-friendly, so they decided to give their model a bit of a head-start but they called it cold-start.

The “cold-start” of the model involved gathering thousands of examples where the model was encouraged to think out loud—essentially narrating its problem-solving steps.

Then, post-training, the scientists combed through the model’s responses, picking the best ones to refine its thinking even further.

As a result, DeepSeek-R1 emerged as a model that didn’t just throw out solutions—it laid out the reasoning behind them, step by step. Not only was it solving problems, but it was doing so in a way that we humans could actually understand and follow along.

The scientists at DeepSeek also wanted to make this model accessible to people - you know folks who do not have supercomputers lying around.

They applied a tech called distillation which is shrinking down a large model’s capabilities into a smaller model without losing too much of its power. These distilled models didn’t need to go through the exhaustive RL training themselves. They learned by watching their big brother.

The result was astonishing as they found that these smaller models performed exceptionally well - sometimes even better than much larger models in the field.

DeepSeek claims its training costs are a fraction of industry leaders like GPT-4. While OpenAI reportedly spends tens of millions of dollars on training, DeepSeek’s cost-efficient techniques like GRPO and distillation make high-performance AI more accessible.

Not only that, their model developed reflective behaviors, like rethinking its approach when it hit a snag. This self-reflection is remarkable because it wasn’t programmed explicitly—it emerged as a natural byproduct of RL rewards.

This raises fascinating questions: can such behaviors evolve further? Could this lead to even more autonomous reasoning?

There were some innovative approaches in their approach to training DeepSeek-R1.

Reward Engineering: The model's behavior was shaped by carefully designed rewards, focusing on reasoning accuracy and language clarity. Unlike other approaches, they didn’t use neural reward models to avoid “reward hacking.”
Reinforcement Learning (RL): They used a scalable RL framework (GRPO) that cut down on computational costs while ensuring the model kept learning effectively.
Emergent Behaviors: DeepSeek-R1-Zero showcased “self-reflection”—a behavior that wasn’t programmed but emerged naturally. This was a direct result of incentivizing correct reasoning rather than rote answers.
Distillation vs. Direct Training: Applying RL directly on smaller models was expensive and less effective. Distillation from a larger, RL-trained model was very efficient and a lot cheaper.

Where Do we Go From Here?

As impressive as DeepSeek-R1 is, it is not without its quirks and there is a lot of room for improvement.

For example, it still has some issues when handling multiple languages. Think of a bilingual friend who sometimes forgets which language they are speaking.

It is sensitive to prompts and can easily be thrown off by few-shot prompts (where it’s given a few examples before tackling a task).

It is also not that great with complex software engineering tasks. Future iterations may need more targeted data in this domain.

DeepSeek isn’t just powerful—it’s open-source. While many cutting-edge LLMs are locked behind corporate walls, DeepSeek puts advanced AI into the hands of anyone with enough technical know-how. This could democratize AI development in ways we haven’t seen before, empowering researchers and small companies worldwide to build innovations that would have previously been out of reach.

There is no doubt, DeepSeek-R2 will improve on R1 and DeepSeek has already released other models, including those capable of generating images

Is this the AGI we have all been waiting for? I don’t think so. We are not there yet, but right now, as the current technique of producing and training LLMs are getting more and more expensive, DeepSeek has shown us that we can do better with good and creative engineering and out of the box thinking.

We may be a while off from AGI yet, but there is not reason why we cannot make our existing models more efficient and cheaper.

Appendix 1: How does Group Relative Policy Optimization (GRPO) work?

Traditional reinforcement learning (RL) uses two models:

The actor model, which generates outputs (e.g., answers or solutions).
The critic model, which evaluates those outputs and assigns reward scores.

Training a separate critic is computationally expensive and can lead to instability.

GRPO skips the critic model by directly comparing groups of outputs produced by the actor model for the same input prompt. It then assigns relative rankings to these outputs based on a reward function (e.g., correctness, clarity, or consistency). The model learns to increase the likelihood of better outputs by comparing them to lower-ranking outputs.

GRPO still needs a reward function to guide training. These rewards are designed to encourage the model to improve specific behaviors, like reasoning clarity or task accuracy.

GRPO adjusts the model by tweaking its output probabilities using these group comparisons. If an output ranks higher, the model learns to generate similar outputs more often.

Sure, the critic model works, but it’s expensive and tricky to train. It’s like hiring a professional food critic to evaluate every cake you bake—effective, but hardly practical. GRPO takes a simpler approach: instead of relying on a critic, the model evaluates itself by comparing groups of outputs it generates for the same input. It assigns rankings to these outputs (based on a reward function like clarity or accuracy) and tweaks its probabilities to favor the higher-ranking outputs.

Why is this better? No critic means less computational overhead and fewer training headaches. It’s faster, cheaper, and works well for language tasks where outputs can naturally be ranked.

Think of it as baking three cakes and asking your friends to pick the best one. You refine your recipe based on their feedback—no need to hire a Michelin-starred chef.

So, naturally one begs the question: Why didn’t the others use this technique before? Is it something DeepSeek invented?

GRPO is not new itself but there was traditional RL bias. RL was mainly used in robotics and gaming and that’s where AI science got its ideas from. Also, earlier AI companies had sponsors with big pockets, so efficiency was not a priority - making it work was. Only now, large language models are advanced enough to warrant this kind of approach.

I am not a GRPO expert or anything but it is also worth noting that, GRPO likely didn’t exist in this specific form before DeepSeek. Since their focus was on being open source and able to run locally, they looked outside of box and came up with a technique that met the challenges they were facing.

This new approach I think will definitely change the AI investment scheme. We, humans, needed to come up with a solution like this as we cannot keep throwing big money at developing LLMs which, in a way, promotes being complacent. Why hadn’t others done this before? Well, I guess, when you have a budget of billions of dollars, you don’t have to innovate.

Appendix 2: Distilling explained

Most approaches today train smaller models directly with RL on limited data. While this can work, smaller models struggle to match the reasoning power of larger models because they lack the "headspace" to explore complex reasoning patterns that emerge during training.

Instead of directly training smaller models with RL, the researchers trained a large, high-capacity model (DeepSeek-R1) on a massive dataset using their multi-stage RL pipeline. This model learned sophisticated reasoning strategies.

Then, they distilled this larger model's knowledge into smaller ones by fine-tuning smaller models on the outputs generated by DeepSeek-R1. This essentially meant that the larger model does the heavy lifting, learning advanced reasoning from rich datasets. The smaller models are "taught" these patterns through supervised fine-tuning.

By distilling the learned reasoning patterns, the smaller models inherit the intelligence of their larger counterpart, outperforming other small models trained independently with RL. In short other teach smaller models directly, limiting what they can learn while DeepSeek approach means teach the big model everything first, then let the smaller models copy its homework.

Appendix 3: Censorship or being Politically-Correct

As touchy subject as it is, we need to also address the elephant in the room.

One of the criticisms of any AI model is that they are sometimes too politically correct and DeepSeek is no exception. When it first came out, being a product from China, people asked various questions about China’s history where the model refused to answer.

In my opinion, this just shows that we need to keep in mind that these models are limited by their dataset (they do not come up with solutions out of thin air) and are as biased as their programmers/developers. Until models can independently explore, learn, and verify information in real-world contexts, their "knowledge" will always be a reflection of the curated (and often biased or incomplete) datasets they’re trained on.

DeepSeek-R1 shows that with creative engineering, we can build smarter, cheaper models. However, as we marvel at these advancements, let’s not forget: these models reflect their creators and the data they consume. Our next challenge is to figure out the steps we should take to ensure they grow into responsible, unbiased thinkers.

One way forward might be greater transparency in refusals, where models explain why they can’t answer. This approach would balance safety with trustworthiness.

It may not be enough, but it would be a start.

DeepSeek-R1: A Primer

DeepSeek-R1-Zero

DeepSeek-R1

Where Do we Go From Here?

Appendix 1: How does Group Relative Policy Optimization (GRPO) work?

Appendix 2: Distilling explained

Appendix 3: Censorship or being Politically-Correct

Comments

More from this blog

Assembly vs Modern Languages

MCP Server: Standardizing LLM API Calls

Building Your First Model Context Protocol Server in C#

Kubernetes Without The Tears — A 7-Day Developer Journey

Day 7 - Kubernetes Without The Tears

Command Palette

DeepSeek-R1-Zero

DeepSeek-R1

Where Do we Go From Here?

Appendix 1: How does Group Relative Policy Optimization (GRPO) work?

Appendix 2: Distilling explained

Appendix 3: Censorship or being Politically-Correct

Comments

More from this blog