No Hype DeepSeek-R1 Reading List

Greg Schoeninger

1/30/2025

DeepSeek-R1 is a big step forward in the open model ecosystem for AI with their latest model competing with OpenAI's o1 on a variety of metrics. There is a lot of hype, and a lot of noise around the fact that they achieved this with much less money and compute.

Source: https://x.com/karpathy/status/1872362712958906460

Instead of learning about it from AI influencer* threads hyping up the release, I decided to make a reading list that links to a lot of the fundamental research papers. This list is meant to be slowly digested one paper at a time with a cup of hot coffee or tea next to a cozy fireplace, not while scrolling social media.

* not you Andrej, we love your threads

If you have been keeping up with the field, R1 doesn't come much as a surprise. It was the natural progression of the research, and it is amazing that they decided to spend all that compute just to give the model weights to the community for free.

We have already covered a bunch of these topics already in our research paper club that gathers on Fridays over zoom. We go deep and don't shy away from the math, but you will walk away having learned something. I try to break it down in as plain of speak as possible. If you want to join our learning journey, feel free to check out our events calendar below!

Transformers Papers

At it's core, DeepSeek is built on a Transformer Neural Network architecture. If you aren't familiar with Transformers, I'd start at some of these foundational papers from Google, OpenAI, Meta and Anthropic.

Attention Is All You Need

This paper introduced the Transformer architecture in the context of machine translation back in 2017, and kicked off the scaling laws trends that lead to GPT-2, GPT-3, ChatGPT, and now the DeepSeek models.

Attention Is All You Need

The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.

arXiv.orgAshish Vaswani

Language Models are Unsupervised Multitask Learners (GPT-2)

This paper showed the generalization of larger scale pre-training with a suite of models that today we would consider small. At the time this was a big deal showing that we no longer had to train specialized models for each task, but that this "unsupervised" learning approach could allow models to "multitask".

Link: https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf

There is also the GPT-3 Paper (Language Models are Few-Shot Learners), that introduces the idea of prompting LLMs. This paper mainly comments on how they scaled up the data and compute.

Training Language Models to Follow Instructions (InstructGPT)

The InstructGPT paper shows how OpenAI went from a pre-trained GPT-3 model to a ChatGPT-like model. They don't explicitly call it ChatGPT in this paper, but if you read between the lines, this was either GPT-3.5 or ChatGPT. The core insight here was collecting data to train a reward model and using reinforcement learning to turn the raw pre-trained model into a useful chatbot that follows instructions.

Training language models to follow instructions with human feedback

Making language models bigger does not inherently make them better at following a user’s intent. For example, large language models can generate outputs that are untruthful, toxic, or simply not helpful to the user. In other words, these models are not aligned with their users. In this paper, we show an avenue for aligning language models with user intent on a wide range of tasks by fine-tuning with human feedback. Starting with a set of labeler-written prompts and prompts submitted through the OpenAI API, we collect a dataset of labeler demonstrations of the desired model behavior, which we use to fine-tune GPT-3 using supervised learning. We then collect a dataset of rankings of model outputs, which we use to further fine-tune this supervised model using reinforcement learning from human feedback. We call the resulting models InstructGPT. In human evaluations on our prompt distribution, outputs from the 1.3B parameter InstructGPT model are preferred to outputs from the 175B GPT-3, despite having 100x fewer parameters. Moreover, InstructGPT models show improvements in truthfulness and reductions in toxic output generation while having minimal performance regressions on public NLP datasets. Even though InstructGPT still makes simple mistakes, our results show that fine-tuning with human feedback is a promising direction for aligning language models with human intent.

arXiv.orgLong Ouyang

Llama-3 Herd Of Models

The Llama-3 Herd of Models paper from Meta was the first big state of the art large language model release that competed with GPT-4. They released a 405B model and a suite of smaller models along with a technical report demystifying the inner workings of the training pipelines.

The Llama 3 Herd of Models

Modern artificial intelligence (AI) systems are powered by foundation models. This paper presents a new set of foundation models, called Llama 3. It is a herd of language models that natively support multilinguality, coding, reasoning, and tool usage. Our largest model is a dense Transformer with 405B parameters and a context window of up to 128K tokens. This paper presents an extensive empirical evaluation of Llama 3. We find that Llama 3 delivers comparable quality to leading language models such as GPT-4 on a plethora of tasks. We publicly release Llama 3, including pre-trained and post-trained versions of the 405B parameter language model and our Llama Guard 3 model for input and output safety. The paper also presents the results of experiments in which we integrate image, video, and speech capabilities into Llama 3 via a compositional approach. We observe this approach performs competitively with the state-of-the-art on image, video, and speech recognition tasks. The resulting models are not yet being broadly released as they are still under development.

arXiv.orgAaron Grattafiori

A Mathematical Framework For Transformers Circuits

Anthropic's blog posts and papers are great for understanding the inner workings of Transformers. This paper dives into the mechanisms that make a Transformer work, starting with the smallest possible "circuit" and working their way up. They are long, and very detailed, but quite worth the read.

A Mathematical Framework for Transformer Circuits

Nelson Elhage∗†

Chain of Thought Reasoning Papers

DeepSeek's R1 and OpenAI's o1 both rely on internal "thought" tokens that contain the model's internal reasoning. This behavior can be prompted for and trained into a model. Using these extra tokens as a scratch pad, models have been shown to solve multi-step problems and tackle more complex tasks. The following papers are good background on how the Chain of Thought reasoning research has progressed over the past few years.

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

With prompting alone, this paper shows that you can get models to generate intermediate reasoning steps before coming to a final answer. The prompting improves model performance on a range of arithmetic, commonsense, and symbolic reasoning tasks. They surpass the performance of the (at the time) state of the art fine-tuned GPT-3 model.

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

We explore how generating a chain of thought -- a series of intermediate reasoning steps -- significantly improves the ability of large language models to perform complex reasoning. In particular, we show how such reasoning abilities emerge naturally in sufficiently large language models via a simple method called chain of thought prompting, where a few chain of thought demonstrations are provided as exemplars in prompting. Experiments on three large language models show that chain of thought prompting improves performance on a range of arithmetic, commonsense, and symbolic reasoning tasks. The empirical gains can be striking. For instance, prompting a 540B-parameter language model with just eight chain of thought exemplars achieves state of the art accuracy on the GSM8K benchmark of math word problems, surpassing even finetuned GPT-3 with a verifier.

arXiv.orgJason Wei

Tree of Thoughts: Deliberate Problem Solving with Large Language Models

When language models produce text from left to right, token by token, if they make a mistake it is hard to backtrack or get the model to correct course. With the Tree of Thoughts paper they allow the model to consider multiple possible reasoning paths while self evaluating the choices to determine the next best choice of action. This is a more expensive technique because it requires many generations and many verifications, but shows the model is able to solve three novel tasks requiring non-trivial planning or search: Game of 24, Creative Writing, and Mini Crosswords.

Tree of Thoughts: Deliberate Problem Solving with Large Language Models

Language models are increasingly being deployed for general problem solving across a wide range of tasks, but are still confined to token-level, left-to-right decision-making processes during inference. This means they can fall short in tasks that require exploration, strategic lookahead, or where initial decisions play a pivotal role. To surmount these challenges, we introduce a new framework for language model inference, Tree of Thoughts (ToT), which generalizes over the popular Chain of Thought approach to prompting language models, and enables exploration over coherent units of text (thoughts) that serve as intermediate steps toward problem solving. ToT allows LMs to perform deliberate decision making by considering multiple different reasoning paths and self-evaluating choices to decide the next course of action, as well as looking ahead or backtracking when necessary to make global choices. Our experiments show that ToT significantly enhances language models’ problem-solving abilities on three novel tasks requiring non-trivial planning or search: Game of 24, Creative Writing, and Mini Crosswords. For instance, in Game of 24, while GPT-4 with chain-of-thought prompting only solved 4% of tasks, our method achieved a success rate of 74%. Code repo with all prompts: https://github.com/princeton-nlp/tree-of-thought-llm.

arXiv.orgShunyu Yao

Graph of Thoughts: Solving Elaborate Problems with Large Language Models

This paper builds on Chain of Thought and Tree of Thought by building an arbitrary graph where units of information ("LLM thoughts") are vertices, and edges correspond to dependencies between these vertices. This helps reduce the computational cost compared to that of Tree of Thought.

Graph of Thoughts: Solving Elaborate Problems with Large Language Models

We introduce Graph of Thoughts (GoT): a framework that advances prompting capabilities in large language models (LLMs) beyond those offered by paradigms such as Chain-of-Thought or Tree of Thoughts (ToT). The key idea and primary advantage of GoT is the ability to model the information generated by an LLM as an arbitrary graph, where units of information (“LLM thoughts”) are vertices, and edges correspond to dependencies between these vertices. This approach enables combining arbitrary LLM thoughts into synergistic outcomes, distilling the essence of whole networks of thoughts, or enhancing thoughts using feedback loops. We illustrate that GoT offers advantages over state of the art on different tasks, for example increasing the quality of sorting by 62% over ToT, while simultaneously reducing costs by >31%. We ensure that GoT is extensible with new thought transformations and thus can be used to spearhead new prompting schemes. This work brings the LLM reasoning closer to human thinking or brain mechanisms such as recurrence, both of which form complex networks.

arXiv.orgMaciej Besta

Everything of Thoughts: Defying the Law of Penrose Triangle for Thought Generation

Also known as XoT, this technique combines a Monte Carlo Tree Search module and incorporates external domain knowledge to solve problems. Notably, XoT can yield multiple solutions with just one LLM call, showcasing its remarkable proficiency in addressing complex problems across diverse domains.

Everything of Thoughts: Defying the Law of Penrose Triangle for Thought Generation

Recent advancements in Large Language Models (LLMs) have revolutionized decision-making by breaking down complex problems into more manageable language sequences referred to as “thoughts”. An effective thought design should consider three key perspectives: performance, efficiency, and flexibility. However, existing thought can at most exhibit two of these attributes. To address these limitations, we introduce a novel thought prompting approach called “Everything of Thoughts” (XoT) to defy the law of “Penrose triangle of existing thought paradigms. XoT leverages pretrained reinforcement learning and Monte Carlo Tree Search (MCTS) to incorporate external domain knowledge into thoughts, thereby enhancing LLMs’ capabilities and enabling them to generalize to unseen problems efficiently. Through the utilization of the MCTS-LLM collaborative thought revision framework, this approach autonomously produces high-quality comprehensive cognitive mappings with minimal LLM interactions. Additionally, XoT empowers LLMs to engage in unconstrained thinking, allowing for flexible cognitive mappings for problems with multiple solutions. We evaluate XoT on several challenging multi-solution problem-solving tasks, including Game of 24, 8-Puzzle, and Pocket Cube. Our results demonstrate that XoT significantly outperforms existing approaches. Notably, XoT can yield multiple solutions with just one LLM call, showcasing its remarkable proficiency in addressing complex problems across diverse domains.

arXiv.orgRuomeng Ding

The Prompt Report

This paper has a good survey on different "of Thought" papers, as well as many other prompting techniques. You could collate all the prompts and techniques from this paper to create some very interesting synthetic datasets to further train better and better models on....just sayin'.

The Prompt Report: A Systematic Survey of Prompting Techniques

Generative Artificial Intelligence (GenAI) systems are increasingly being deployed across diverse industries and research domains. Developers and end-users interact with these systems through the use of prompting and prompt engineering. Although prompt engineering is a widely adopted and extensively researched area, it suffers from conflicting terminology and a fragmented ontological understanding of what constitutes an effective prompt due to its relatively recent emergence. We establish a structured understanding of prompt engineering by assembling a taxonomy of prompting techniques and analyzing their applications. We present a detailed vocabulary of 33 vocabulary terms, a taxonomy of 58 LLM prompting techniques, and 40 techniques for other modalities. Additionally, we provide best practices and guidelines for prompt engineering, including advice for prompting state-of-the-art (SOTA) LLMs such as ChatGPT. We further present a meta-analysis of the entire literature on natural language prefix-prompting. As a culmination of these efforts, this paper presents the most comprehensive survey on prompt engineering to date.

arXiv.orgSander Schulhoff

Mixture of Experts Papers

DeepSeek-V3, is what they call a "strong Mixture-of-Experts (MoE) language" model with 671B total parameters with 37B activated for each token. GPT-4 had been long rumored to be a Mixture of Experts. The motivation behind these architectures is that some tokens require different levels of understanding, and by dividing the model into many experts, you can balance number of active parameters with model understanding and even get better performance than a fully dense model.

GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

In one of the early Mixture of Experts papers they referred to the technique as "sharding" the model weights. They show that a giant model can efficiently be trained in 4 days to achieve far superior quality for translation from 100 languages to English compared to the prior art. This helps scale up the model weights while maintaining model performance in terms of compute and accuracy.

GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

Neural network scaling has been critical for improving the model quality in many real-world machine learning applications with vast amounts of training data and compute. Although this trend of scaling is affirmed to be a sure-fire approach for better model quality, there are challenges on the path such as the computation cost, ease of programming, and efficient implementation on parallel devices. GShard is a module composed of a set of lightweight annotation APIs and an extension to the XLA compiler. It provides an elegant way to express a wide range of parallel computation patterns with minimal changes to the existing model code. GShard enabled us to scale up multilingual neural machine translation Transformer model with Sparsely-Gated Mixture-of-Experts beyond 600 billion parameters using automatic sharding. We demonstrate that such a giant model can efficiently be trained on 2048 TPU v3 accelerators in 4 days to achieve far superior quality for translation from 100 languages to English compared to the prior art.

arXiv.orgDmitry Lepikhin

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

The Switch Transformers paper trains what they refer to as a model with an outrageous numbers of parameters. The simplify the routing algorithm in the MoE to improve stability of training large models and improve computational cost.

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

In deep learning, models typically reuse the same parameters for all inputs. Mixture of Experts (MoE) defies this and instead selects different parameters for each incoming example. The result is a sparsely-activated model -- with outrageous numbers of parameters -- but a constant computational cost. However, despite several notable successes of MoE, widespread adoption has been hindered by complexity, communication costs and training instability -- we address these with the Switch Transformer. We simplify the MoE routing algorithm and design intuitive improved models with reduced communication and computational costs. Our proposed training techniques help wrangle the instabilities and we show large sparse models may be trained, for the first time, with lower precision (bfloat16) formats. We design models based off T5-Base and T5-Large to obtain up to 7x increases in pre-training speed with the same computational resources. These improvements extend into multilingual settings where we measure gains over the mT5-Base version across all 101 languages. Finally, we advance the current scale of language models by pre-training up to trillion parameter models on the “Colossal Clean Crawled Corpus” and achieve a 4x speedup over the T5-XXL model.

arXiv.orgWilliam Fedus

A Review of Sparse Expert Models in Deep Learning

MoE's are not new by any means, this paper is a good historical dive into what's been tried in the world of sparsity in deep learning models.

A Review of Sparse Expert Models in Deep Learning

Sparse expert models are a thirty-year old concept re-emerging as a popular architecture in deep learning. This class of architecture encompasses Mixture-of-Experts, Switch Transformers, Routing Networks, BASE layers, and others, all with the unifying idea that each example is acted on by a subset of the parameters. By doing so, the degree of sparsity decouples the parameter count from the compute per example allowing for extremely large, but efficient models. The resulting models have demonstrated significant improvements across diverse domains such as natural language processing, computer vision, and speech recognition. We review the concept of sparse expert models, provide a basic description of the common algorithms, contextualize the advances in the deep learning era, and conclude by highlighting areas for future work.

arXiv.orgWilliam Fedus

Mixtral of Experts

This paper is a short and sweet dive into what Mistral did for it's small 8x7B MoE. They match GPT-3.5 level performance and released the model weights under Apache 2.0 License. I enjoy this paper's brevity and ease of reading.

Mixtral of Experts

We introduce Mixtral 8x7B, a Sparse Mixture of Experts (SMoE) language model. Mixtral has the same architecture as Mistral 7B, with the difference that each layer is composed of 8 feedforward blocks (i.e. experts). For every token, at each layer, a router network selects two experts to process the current state and combine their outputs. Even though each token only sees two experts, the selected experts can be different at each timestep. As a result, each token has access to 47B parameters, but only uses 13B active parameters during inference. Mixtral was trained with a context size of 32k tokens and it outperforms or matches Llama 2 70B and GPT-3.5 across all evaluated benchmarks. In particular, Mixtral vastly outperforms Llama 2 70B on mathematics, code generation, and multilingual benchmarks. We also provide a model fine-tuned to follow instructions, Mixtral 8x7B - Instruct, that surpasses GPT-3.5 Turbo, Claude-2.1, Gemini Pro, and Llama 2 70B - chat model on human benchmarks. Both the base and instruct models are released under the Apache 2.0 license.

arXiv.orgAlbert Q. Jiang

Upcycling MoEs Beat Dense LLMs

Upcycling is an interesting technique by the team at Nvidia. We also had the author on Arxiv Dives to chat about his work. The idea is taking a set of pre-trained dense models and combining them into a mixture of experts. I think there's a lot of exploration that can be done here with combining open weights models and Upcycling them into smarter models.

Upcycling Large Language Models into Mixture of Experts

Upcycling pre-trained dense language models into sparse mixture-of-experts (MoE) models is an efficient approach to increase the model capacity of already trained models. However, optimal techniques for upcycling at scale remain unclear. In this work, we conduct an extensive study of upcycling methods and hyperparameters for billion-parameter scale language models. We propose a novel “virtual group” initialization scheme and weight scaling approach to enable upcycling into fine-grained MoE architectures. Through ablations, we find that upcycling outperforms continued dense model training. In addition, we show that softmax-then-topK expert routing improves over topK-then-softmax approach and higher granularity MoEs can help improve accuracy. Finally, we upcycled Nemotron-4 15B on 1T tokens and compared it to a continuously trained version of the same model on the same 1T tokens: the continuous trained model achieved 65.3% MMLU, whereas the upcycled model achieved 67.6%. Our results offer insights and best practices to effectively leverage upcycling for building MoE language models.

arXiv.orgEthan He

Reinforcement Learning Papers

The cherry on top of the cake as Yann LeCunn likes to say. This is what turns a pre-trained LLM into a ChatBot with personality, tone, and utility. It also helps align the models with human preferences. This section will mainly touch on RL in the context of post-training LLMs even though there is a ton of other research in this field.

RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback

This paper scales up the data pipeline for providing an LLM with feedback by removing the human from the loop. RLHF (RL from human feedback) is a reliable source of signal because it is humans give the feedback, but is expensive to collect data for. They show it's possible to get signal from an LLM acting as the reward model. This tees up other work into self-rewarding language models and eventually R1 and o1.

RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback

Reinforcement learning from human feedback (RLHF) has proven effective in aligning large language models (LLMs) with human preferences, but gathering high-quality preference labels is expensive. RL from AI Feedback (RLAIF), introduced in Bai et al., offers a promising alternative that trains the reward model (RM) on preferences generated by an off-the-shelf LLM. Across the tasks of summarization, helpful dialogue generation, and harmless dialogue generation, we show that RLAIF achieves comparable performance to RLHF. Furthermore, we take a step towards “self-improvement” by demonstrating that RLAIF can outperform a supervised fine-tuned baseline even when the AI labeler is the same size as the policy, or even the exact same checkpoint as the initial policy. Finally, we introduce direct-RLAIF (d-RLAIF) - a technique that circumvents RM training by obtaining rewards directly from an off-the-shelf LLM during RL, which achieves superior performance to canonical RLAIF. Our results suggest that RLAIF can achieve performance on-par with using human feedback, offering a potential solution to the scalability limitations of RLHF.

arXiv.orgHarrison Lee

Self Rewarding Language Models

The first line of this abstract is a banger: "We posit that to achieve superhuman agents, future models require superhuman feedback in order to provide an adequate training signal."

In this paper they show that you can not only use an external reward model, but use the same LLM as both the generator and the reward model. The idea is that if the same model weights learn how to generate text and understand what is good and bad outputs, the better the performance would be. They setup this model in a loop and saw consistent improvement of the model judging itself and improving over 3 training cycles.

Self-Rewarding Language Models

We posit that to achieve superhuman agents, future models require superhuman feedback in order to provide an adequate training signal. Current approaches commonly train reward models from human preferences, which may then be bottlenecked by human performance level, and secondly these separate frozen reward models cannot then learn to improve during LLM training. In this work, we study Self-Rewarding Language Models, where the language model itself is used via LLM-as-a-Judge prompting to provide its own rewards during training. We show that during Iterative DPO training that not only does instruction following ability improve, but also the ability to provide high-quality rewards to itself. Fine-tuning Llama 2 70B on three iterations of our approach yields a model that outperforms many existing systems on the AlpacaEval 2.0 leaderboard, including Claude 2, Gemini Pro, and GPT-4 0613. While there is much left still to explore, this work opens the door to the possibility of models that can continually improve in both axes.

arXiv.orgWeizhe Yuan

Thinking LLMs: General Instruction Following with Thought Generation

The same team from Meta that did The Self Rewarding Language Model paper above came back after o1 was released with a similar pipeline, this time incorporating Chain-of-Thought reasoning. They rolled this research out super quick after o1, and didn't release any models from it, but it is a very similar pipeline to what you would do to train an R1 style model.

Thinking LLMs: General Instruction Following with Thought Generation

LLMs are typically trained to answer user questions or follow instructions similarly to how human experts respond. However, in the standard alignment framework they lack the basic ability of explicit thinking before answering. Thinking is important for complex questions that require reasoning and planning -- but can be applied to any task. We propose a training method for equipping existing LLMs with such thinking abilities for general instruction following without use of additional human data. We achieve this by an iterative search and optimization procedure that explores the space of possible thought generations, allowing the model to learn how to think without direct supervision. For each instruction, the thought candidates are scored using a judge model to evaluate their responses only, and then optimized via preference optimization. We show that this procedure leads to superior performance on AlpacaEval and Arena-Hard, and shows gains from thinking on non-reasoning categories such as marketing, health and general knowledge, in addition to more traditional reasoning & problem-solving tasks.

arXiv.orgTianhao Wu

DPO - Direct Preference Optimization

Throwing the DPO paper in this section, even though there are many other signals one could use for reinforcement learning such as PPO or the GRPO that is used in DeepSeek. DPO is the easiest to understand in my book, and will give you a good jumping off point for other techniques.

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

While large-scale unsupervised language models (LMs) learn broad world knowledge and some reasoning skills, achieving precise control of their behavior is difficult due to the completely unsupervised nature of their training. Existing methods for gaining such steerability collect human labels of the relative quality of model generations and fine-tune the unsupervised LM to align with these preferences, often with reinforcement learning from human feedback (RLHF). However, RLHF is a complex and often unstable procedure, first fitting a reward model that reflects the human preferences, and then fine-tuning the large unsupervised LM using reinforcement learning to maximize this estimated reward without drifting too far from the original model. In this paper we introduce a new parameterization of the reward model in RLHF that enables extraction of the corresponding optimal policy in closed form, allowing us to solve the standard RLHF problem with only a simple classification loss. The resulting algorithm, which we call Direct Preference Optimization (DPO), is stable, performant, and computationally lightweight, eliminating the need for sampling from the LM during fine-tuning or performing significant hyperparameter tuning. Our experiments show that DPO can fine-tune LMs to align with human preferences as well as or better than existing methods. Notably, fine-tuning with DPO exceeds PPO-based RLHF in ability to control sentiment of generations, and matches or improves response quality in summarization and single-turn dialogue while being substantially simpler to implement and train.

arXiv.orgRafael Rafailov

Deep Seek Papers

Last but not least, are the DeepSeek papers themselves. I thought I'd start with the non-DeepSeek papers to give you a baseline of understanding before diving into the "Deep" end. There is quite the progression of work that lead up to the overnight success of R1, so I wouldn't sleep on any of the papers below.

DeepSeekLLM: Scaling Open-Source Language Models with Longertermism

This was the V1 of their base language model. Here DeepSeek is exploring the limits of scaling laws and follow the now well established pattern of pre-training, supervised fine tuning, and DPO to get to a final chat model.

DeepSeek LLM: Scaling Open-Source Language Models with Longtermism

The rapid development of open-source large language models (LLMs) has been truly remarkable. However, the scaling law described in previous literature presents varying conclusions, which casts a dark cloud over scaling LLMs. We delve into the study of scaling laws and present our distinctive findings that facilitate scaling of large scale models in two commonly used open-source configurations, 7B and 67B. Guided by the scaling laws, we introduce DeepSeek LLM, a project dedicated to advancing open-source language models with a long-term perspective. To support the pre-training phase, we have developed a dataset that currently consists of 2 trillion tokens and is continuously expanding. We further conduct supervised fine-tuning (SFT) and Direct Preference Optimization (DPO) on DeepSeek LLM Base models, resulting in the creation of DeepSeek Chat models. Our evaluation results demonstrate that DeepSeek LLM 67B surpasses LLaMA-2 70B on various benchmarks, particularly in the domains of code, mathematics, and reasoning. Furthermore, open-ended evaluations reveal that DeepSeek LLM 67B Chat exhibits superior performance compared to GPT-3.5.

arXiv.orgDeepSeek-AI

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

DeepSeek 🤝 MoE still with all your favorite SFT and RL at the end to get the final model. Here DeepSeek extends V1 to be a Mixture of Experts improving performance and reducing training costs by 42%. They are starting to heat up here.

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

We present DeepSeek-V2, a strong Mixture-of-Experts (MoE) language model characterized by economical training and efficient inference. It comprises 236B total parameters, of which 21B are activated for each token, and supports a context length of 128K tokens. DeepSeek-V2 adopts innovative architectures including Multi-head Latent Attention (MLA) and DeepSeekMoE. MLA guarantees efficient inference through significantly compressing the Key-Value (KV) cache into a latent vector, while DeepSeekMoE enables training strong models at an economical cost through sparse computation. Compared with DeepSeek 67B, DeepSeek-V2 achieves significantly stronger performance, and meanwhile saves 42.5% of training costs, reduces the KV cache by 93.3%, and boosts the maximum generation throughput to 5.76 times. We pretrain DeepSeek-V2 on a high-quality and multi-source corpus consisting of 8.1T tokens, and further perform Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) to fully unlock its potential. Evaluation results show that, even with only 21B activated parameters, DeepSeek-V2 and its chat versions still achieve top-tier performance among open-source models.

arXiv.orgDeepSeek-AI

DeepSeek-V3 Technical Report

This paper went a bit under hyped compared to R1, maybe because it was released on December 26th and all the AI influencers were on Christmas Break. This model is where we get the shocking figure of $5 million to train instead of the $100 million other labs were reporting. They released their checkpoints as a present to the rest of the world, and achieved performance on par with a lot of the other frontier labs.

DeepSeek-V3 Technical Report

We present DeepSeek-V3, a strong Mixture-of-Experts (MoE) language model with 671B total parameters with 37B activated for each token. To achieve efficient inference and cost-effective training, DeepSeek-V3 adopts Multi-head Latent Attention (MLA) and DeepSeekMoE architectures, which were thoroughly validated in DeepSeek-V2. Furthermore, DeepSeek-V3 pioneers an auxiliary-loss-free strategy for load balancing and sets a multi-token prediction training objective for stronger performance. We pre-train DeepSeek-V3 on 14.8 trillion diverse and high-quality tokens, followed by Supervised Fine-Tuning and Reinforcement Learning stages to fully harness its capabilities. Comprehensive evaluations reveal that DeepSeek-V3 outperforms other open-source models and achieves performance comparable to leading closed-source models. Despite its excellent performance, DeepSeek-V3 requires only 2.788M H800 GPU hours for its full training. In addition, its training process is remarkably stable. Throughout the entire training process, we did not experience any irrecoverable loss spikes or perform any rollbacks. The model checkpoints are available at https://github.com/deepseek-ai/DeepSeek-V3.

arXiv.orgDeepSeek-AI

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

We finally have our o1 competitor, open source, and free for all to download and try. Well, that is if you want to download 670GB of model weights and have a cluster of GPUs to run them on. Luckily they also distilled a set of smaller models that can even run locally on a modern Macbook. These models are a promising step forward for open source and open models, and a great jumping off point for people to create synthetic datasets and run SOTA models at home.

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

We introduce our first-generation reasoning models, DeepSeek-R1-Zero and DeepSeek-R1. DeepSeek-R1-Zero, a model trained via large-scale reinforcement learning (RL) without supervised fine-tuning (SFT) as a preliminary step, demonstrates remarkable reasoning capabilities. Through RL, DeepSeek-R1-Zero naturally emerges with numerous powerful and intriguing reasoning behaviors. However, it encounters challenges such as poor readability, and language mixing. To address these issues and further enhance reasoning performance, we introduce DeepSeek-R1, which incorporates multi-stage training and cold-start data before RL. DeepSeek-R1 achieves performance comparable to OpenAI-o1-1217 on reasoning tasks. To support the research community, we open-source DeepSeek-R1-Zero, DeepSeek-R1, and six dense models (1.5B, 7B, 8B, 14B, 32B, 70B) distilled from DeepSeek-R1 based on Qwen and Llama.

arXiv.orgDeepSeek-AI

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

In the R1 paper they mention that they use an algorithm GRPO during the Reinforcement Learning stage. GRPO is actually introduced in this DeepSeekMath paper where they improve a models ability to reason through math problems. This paper is a sneaky MVP in the set of DeepSeek papers and would highly recommend it.

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Mathematical reasoning poses a significant challenge for language models due to its complex and structured nature. In this paper, we introduce DeepSeekMath 7B, which continues pre-training DeepSeek-Coder-Base-v1.5 7B with 120B math-related tokens sourced from Common Crawl, together with natural language and code data. DeepSeekMath 7B has achieved an impressive score of 51.7% on the competition-level MATH benchmark without relying on external toolkits and voting techniques, approaching the performance level of Gemini-Ultra and GPT-4. Self-consistency over 64 samples from DeepSeekMath 7B achieves 60.9% on MATH. The mathematical reasoning capability of DeepSeekMath is attributed to two key factors: First, we harness the significant potential of publicly available web data through a meticulously engineered data selection pipeline. Second, we introduce Group Relative Policy Optimization (GRPO), a variant of Proximal Policy Optimization (PPO), that enhances mathematical reasoning abilities while concurrently optimizing the memory usage of PPO.

arXiv.orgZhihong Shao

A couple more DeepSeek papers that may have gotten lost in the mix are:

Want to Learn More?

If this felt like a lot, don't worry, you're not alone. We gather on Fridays as a group to discuss papers like this and break them down in terms that anyone can understand or apply to their own work. We've already covered a lot of these papers and more in our past set of dives which you can find on our website or YouTube.

Let us know if there is anything we missed or any papers that you want to go over in the future by joining our Discord.

Join the oxen Discord Server!

Check out the oxen community on Discord - hang out with 1240 other members and enjoy free voice and text chat.

Discord