DeepSeek-R1: Enhancing LLM Reasoning via Reinforcement Learning

deepseek


Key Contributions

1. Reinforcement Learning-Driven Reasoning

  • DeepSeek-R1-Zero is trained via large-scale RL without supervised fine-tuning (SFT).
  • Achieves competitive reasoning performance but struggles with readability and language mixing.
  • DeepSeek-R1 integrates cold-start data and supervised fine-tuning (SFT) to address these issues.

2. Benchmark Performance

DeepSeek-R1 achieves state-of-the-art results on reasoning-heavy tasks:

  • Math: 97.3% on MATH-500 (better than OpenAI-o1-1217).
  • Coding: 96.3% percentile on Codeforces.
  • General knowledge: 90.8% on MMLU, surpassing many open-source models.

3. Model Distillation for Smaller Models

  • Knowledge from DeepSeek-R1 is distilled into smaller models (1.5B to 70B parameters).
  • Distilled 32B & 70B models outperform QwQ-32B-Preview and set new records among dense models.


Training Approach

1. DeepSeek-R1-Zero (Pure RL)

  • No SFT used.
  • Exhibits emergent reasoning behaviors.
  • Poor readability & language consistency.

2. DeepSeek-R1 (Cold-Start + RL)

  • Cold-start fine-tuning with curated data.
  • Reinforcement learning for reasoning improvement.
  • Supervised fine-tuning with diverse datasets.
  • Final RL tuning for alignment with human preferences.

3. Distillation to Smaller Models

  • Fine-tuning smaller models (Qwen, Llama series) with 800K reasoning samples.


Findings & Limitations

  • RL alone is insufficient for smaller models—distillation works better.
  • Challenges in software engineering tasks due to RL inefficiencies.
  • Prompt sensitivity: Few-shot prompts degrade performance.
  • Multilingual limitations: Language mixing issues persist beyond English & Chinese.


Future Directions

  • Improve general capabilities like function calling and role-playing.
  • Enhance language consistency across multiple languages.
  • Optimize reinforcement learning for software engineering tasks.
위로 스크롤