DeepSeek-R1 is a big step forward in the open model ecosystem for AI with their latest model competing with OpenAI's o1 on a variety of metrics. There is a lot of hype, and a lot of noise around the fact that they achieved this with much less money and compute.
Instead of learning about it from AI influencer* threads hyping up the release, I decided to make a reading list that links to a lot of the fundamental research papers. This list is meant to be slowly digested one paper at a time with a cup of hot coffee or tea next to a cozy fireplace, not while scrolling social media.
* not you Andrej, we love your threads
If you have been keeping up with the field, R1 doesn't come much as a surprise. It was the natural progression of the research, and it is amazing that they decided to spend all that compute just to give the model weights to the community for free.
We have already covered a bunch of these topics already in our research paper club that gathers on Fridays over zoom. We go deep and don't shy away from the math, but you will walk away having learned something. I try to break it down in as plain of speak as possible. If you want to join our learning journey, feel free to check out our events calendar below!
Transformers Papers
At it's core, DeepSeek is built on a Transformer Neural Network architecture. If you aren't familiar with Transformers, I'd start at some of these foundational papers from Google, OpenAI, Meta and Anthropic.
This paper introduced the Transformer architecture in the context of machine translation back in 2017, and kicked off the scaling laws trends that lead to GPT-2, GPT-3, ChatGPT, and now the DeepSeek models.
This paper showed the generalization of larger scale pre-training with a suite of models that today we would consider small. At the time this was a big deal showing that we no longer had to train specialized models for each task, but that this "unsupervised" learning approach could allow models to "multitask".
There is also the GPT-3 Paper (Language Models are Few-Shot Learners), that introduces the idea of prompting LLMs. This paper mainly comments on how they scaled up the data and compute.
The InstructGPT paper shows how OpenAI went from a pre-trained GPT-3 model to a ChatGPT-like model. They don't explicitly call it ChatGPT in this paper, but if you read between the lines, this was either GPT-3.5 or ChatGPT. The core insight here was collecting data to train a reward model and using reinforcement learning to turn the raw pre-trained model into a useful chatbot that follows instructions.
The Llama-3 Herd of Models paper from Meta was the first big state of the art large language model release that competed with GPT-4. They released a 405B model and a suite of smaller models along with a technical report demystifying the inner workings of the training pipelines.
Anthropic's blog posts and papers are great for understanding the inner workings of Transformers. This paper dives into the mechanisms that make a Transformer work, starting with the smallest possible "circuit" and working their way up. They are long, and very detailed, but quite worth the read.
Chain of Thought Reasoning Papers
DeepSeek's R1 and OpenAI's o1 both rely on internal "thought" tokens that contain the model's internal reasoning. This behavior can be prompted for and trained into a model. Using these extra tokens as a scratch pad, models have been shown to solve multi-step problems and tackle more complex tasks. The following papers are good background on how the Chain of Thought reasoning research has progressed over the past few years.
With prompting alone, this paper shows that you can get models to generate intermediate reasoning steps before coming to a final answer. The prompting improves model performance on a range of arithmetic, commonsense, and symbolic reasoning tasks. They surpass the performance of the (at the time) state of the art fine-tuned GPT-3 model.
When language models produce text from left to right, token by token, if they make a mistake it is hard to backtrack or get the model to correct course. With the Tree of Thoughts paper they allow the model to consider multiple possible reasoning paths while self evaluating the choices to determine the next best choice of action. This is a more expensive technique because it requires many generations and many verifications, but shows the model is able to solve three novel tasks requiring non-trivial planning or search: Game of 24, Creative Writing, and Mini Crosswords.
This paper has a good survey on different "of Thought" papers, as well as many other prompting techniques. You could collate all the prompts and techniques from this paper to create some very interesting synthetic datasets to further train better and better models on....just sayin'.
Mixture of Experts Papers
DeepSeek-V3, is what they call a "strong Mixture-of-Experts (MoE) language" model with 671B total parameters with 37B activated for each token. GPT-4 had been long rumored to be a Mixture of Experts. The motivation behind these architectures is that some tokens require different levels of understanding, and by dividing the model into many experts, you can balance number of active parameters with model understanding and even get better performance than a fully dense model.
In one of the early Mixture of Experts papers they referred to the technique as "sharding" the model weights. They show that a giant model can efficiently be trained in 4 days to achieve far superior quality for translation from 100 languages to English compared to the prior art. This helps scale up the model weights while maintaining model performance in terms of compute and accuracy.
The Switch Transformers paper trains what they refer to as a model with an outrageous numbers of parameters. The simplify the routing algorithm in the MoE to improve stability of training large models and improve computational cost.
This paper is a short and sweet dive into what Mistral did for it's small 8x7B MoE. They match GPT-3.5 level performance and released the model weights under Apache 2.0 License. I enjoy this paper's brevity and ease of reading.
Upcycling is an interesting technique by the team at Nvidia. We also had the author on Arxiv Dives to chat about his work. The idea is taking a set of pre-trained dense models and combining them into a mixture of experts. I think there's a lot of exploration that can be done here with combining open weights models and Upcycling them into smarter models.
Reinforcement Learning Papers
The cherry on top of the cake as Yann LeCunn likes to say. This is what turns a pre-trained LLM into a ChatBot with personality, tone, and utility. It also helps align the models with human preferences. This section will mainly touch on RL in the context of post-training LLMs even though there is a ton of other research in this field.
This paper scales up the data pipeline for providing an LLM with feedback by removing the human from the loop. RLHF (RL from human feedback) is a reliable source of signal because it is humans give the feedback, but is expensive to collect data for. They show it's possible to get signal from an LLM acting as the reward model. This tees up other work into self-rewarding language models and eventually R1 and o1.
The first line of this abstract is a banger: "We posit that to achieve superhuman agents, future models require superhuman feedback in order to provide an adequate training signal."
In this paper they show that you can not only use an external reward model, but use the same LLM as both the generator and the reward model. The idea is that if the same model weights learn how to generate text and understand what is good and bad outputs, the better the performance would be. They setup this model in a loop and saw consistent improvement of the model judging itself and improving over 3 training cycles.
The same team from Meta that did The Self Rewarding Language Model paper above came back after o1 was released with a similar pipeline, this time incorporating Chain-of-Thought reasoning. They rolled this research out super quick after o1, and didn't release any models from it, but it is a very similar pipeline to what you would do to train an R1 style model.
Throwing the DPO paper in this section, even though there are many other signals one could use for reinforcement learning such as PPO or the GRPO that is used in DeepSeek. DPO is the easiest to understand in my book, and will give you a good jumping off point for other techniques.
Deep Seek Papers
Last but not least, are the DeepSeek papers themselves. I thought I'd start with the non-DeepSeek papers to give you a baseline of understanding before diving into the "Deep" end. There is quite the progression of work that lead up to the overnight success of R1, so I wouldn't sleep on any of the papers below.
This was the V1 of their base language model. Here DeepSeek is exploring the limits of scaling laws and follow the now well established pattern of pre-training, supervised fine tuning, and DPO to get to a final chat model.
DeepSeek 🤝 MoE still with all your favorite SFT and RL at the end to get the final model. Here DeepSeek extends V1 to be a Mixture of Experts improving performance and reducing training costs by 42%. They are starting to heat up here.
This paper went a bit under hyped compared to R1, maybe because it was released on December 26th and all the AI influencers were on Christmas Break. This model is where we get the shocking figure of $5 million to train instead of the $100 million other labs were reporting. They released their checkpoints as a present to the rest of the world, and achieved performance on par with a lot of the other frontier labs.
We finally have our o1 competitor, open source, and free for all to download and try. Well, that is if you want to download 670GB of model weights and have a cluster of GPUs to run them on. Luckily they also distilled a set of smaller models that can even run locally on a modern Macbook. These models are a promising step forward for open source and open models, and a great jumping off point for people to create synthetic datasets and run SOTA models at home.
In the R1 paper they mention that they use an algorithm GRPO during the Reinforcement Learning stage. GRPO is actually introduced in this DeepSeekMath paper where they improve a models ability to reason through math problems. This paper is a sneaky MVP in the set of DeepSeek papers and would highly recommend it.
A couple more DeepSeek papers that may have gotten lost in the mix are:
If this felt like a lot, don't worry, you're not alone. We gather on Fridays as a group to discuss papers like this and break them down in terms that anyone can understand or apply to their own work. We've already covered a lot of these papers and more in our past set of dives which you can find on our website or YouTube.
Let us know if there is anything we missed or any papers that you want to go over in the future by joining our Discord.