DeepSeek-AI introduces DeepSeek-R1-Zero and DeepSeek-R1, two reasoning-focused large language models (LLMs) developed using reinforcement learning (RL). DeepSeek-R1-Zero was trained purely with RL, while DeepSeek-R1 incorporates cold-start fine-tuning and multi-stage training to improve readability and reasoning performance.

Key Contributions
1. Reinforcement Learning-Driven Reasoning
- DeepSeek-R1-Zero is trained via large-scale RL without supervised fine-tuning (SFT).
- Achieves competitive reasoning performance but struggles with readability and language mixing.
- DeepSeek-R1 integrates cold-start data and supervised fine-tuning (SFT) to address these issues.
2. Benchmark Performance
DeepSeek-R1 achieves state-of-the-art results on reasoning-heavy tasks:
- Math: 97.3% on MATH-500 (better than OpenAI-o1-1217).
- Coding: 96.3% percentile on Codeforces.
- General knowledge: 90.8% on MMLU, surpassing many open-source models.
3. Model Distillation for Smaller Models
- Knowledge from DeepSeek-R1 is distilled into smaller models (1.5B to 70B parameters).
- Distilled 32B & 70B models outperform QwQ-32B-Preview and set new records among dense models.
Training Approach
1. DeepSeek-R1-Zero (Pure RL)
- No SFT used.
- Exhibits emergent reasoning behaviors.
- Poor readability & language consistency.
2. DeepSeek-R1 (Cold-Start + RL)
- Cold-start fine-tuning with curated data.
- Reinforcement learning for reasoning improvement.
- Supervised fine-tuning with diverse datasets.
- Final RL tuning for alignment with human preferences.
3. Distillation to Smaller Models
- Fine-tuning smaller models (Qwen, Llama series) with 800K reasoning samples.
Findings & Limitations
- RL alone is insufficient for smaller models—distillation works better.
- Challenges in software engineering tasks due to RL inefficiencies.
- Prompt sensitivity: Few-shot prompts degrade performance.
- Multilingual limitations: Language mixing issues persist beyond English & Chinese.
Future Directions
- Improve general capabilities like function calling and role-playing.
- Enhance language consistency across multiple languages.
- Optimize reinforcement learning for software engineering tasks.