Model Library
Browse and deploy state-of-the-art AI models through the DevUp Gateway.
Browse and deploy state-of-the-art AI models through the DevUp Gateway.
DeepSeek-V3, a strong Mixture-of-Experts (MoE) language model with 671B total parameters with 37B activated for each token. To achieve efficient inference and cost-effective training, DeepSeek-V3 adopts Multi-head Latent Attention (MLA) and DeepSeekMoE architectures, which were thoroughly validated in DeepSeek-V2.

We present DeepSeek-V3, a strong Mixture-of-Experts (MoE) language model with 671B total parameters with 37B activated for each token. To achieve efficient inference and cost-effective training, DeepSeek-V3 adopts Multi-head Latent Attention (MLA) and DeepSeekMoE architectures, which were thoroughly validated in DeepSeek-V2. Furthermore, DeepSeek-V3 pioneers an auxiliary-loss-free strategy for load balancing and sets a multi-token prediction training objective for stronger performance. We pre-train DeepSeek-V3 on 14.8 trillion diverse and high-quality tokens, followed by Supervised Fine-Tuning and Reinforcement Learning stages to fully harness its capabilities. Comprehensive evaluations reveal that DeepSeek-V3 outperforms other open-source models and achieves performance comparable to leading closed-source models. Despite its excellent performance, DeepSeek-V3 requires only 2.788M H800 GPU hours for its full training. In addition, its training process is remarkably stable. Throughout the entire training process, we did not experience any irrecoverable loss spikes or perform any rollbacks.
| Model | Total Params | Activated Params | Context Length | Download |
|---|---|---|---|---|
| DeepSeek-V3-Base | 671B | 37B | 128K | 🤗 HuggingFace |
| DeepSeek-V3 | 671B | 37B | 128K | 🤗 HuggingFace |
NOTE: The total size of DeepSeek-V3 models on HuggingFace is 685B, which includes 671B of the Main Model weights and 14B of the Multi-Token Prediction (MTP) Module weights.
To ensure optimal performance and flexibility, we have partnered with open-source communities and hardware vendors to provide multiple ways to run the model locally. For step-by-step guidance, check out Section 6: How_to Run_Locally.
For developers looking to dive deeper, we recommend exploring README_WEIGHTS.md for details on the Main Model weights and the Multi-Token Prediction (MTP) Modules. Please note that MTP support is currently under active development within the community, and we welcome your contributions and feedback.
| Benchmark (Metric) | # Shots | DeepSeek-V2 | Qwen2.5 72B | LLaMA3.1 405B | DeepSeek-V3 |
|---|---|---|---|---|---|
| Architecture (-) | - | MoE | Dense | Dense | MoE |
| # Activated Params (-) | - | 21B | 72B | 405B | 37B |
| # Total Params (-) | - | 236B | 72B | 405B | 671B |
| English Pile-test (BPB) | - | 0.606 | 0.638 | 0.542 | 0.548 |
| English BBH (EM) | 3-shot | 78.8 | 79.8 | 82.9 | 87.5 |
| English MMLU (Acc.) | 5-shot | 78.4 | 85.0 | 84.4 | 87.1 |
| English MMLU-Redux (Acc.) | 5-shot | 75.6 | 83.2 | 81.3 | 86.2 |
| English MMLU-Pro (Acc.) | 5-shot | 51.4 | 58.3 | 52.8 | 64.4 |
| English DROP (F1) | 3-shot | 80.4 | 80.6 | 86.0 | 89.0 |
| English ARC-Easy (Acc.) | 25-shot | 97.6 | 98.4 | 98.4 | 98.9 |
| English ARC-Challenge (Acc.) | 25-shot | 92.2 | 94.5 | 95.3 | 95.3 |
| English HellaSwag (Acc.) | 10-shot | 87.1 | 84.8 | 89.2 | 88.9 |
| English PIQA (Acc.) | 0-shot | 83.9 | 82.6 | 85.9 | 84.7 |
| English WinoGrande (Acc.) | 5-shot | 86.3 | 82.3 | 85.2 | 84.9 |
| English RACE-Middle (Acc.) | 5-shot | 73.1 | 68.1 | 74.2 | 67.1 |
| English RACE-High (Acc.) | 5-shot | 52.6 | 50.3 | 56.8 | 51.3 |
| English TriviaQA (EM) | 5-shot | 80.0 | 71.9 | 82.7 | 82.9 |
| English NaturalQuestions (EM) | 5-shot | 38.6 | 33.2 | 41.5 | 40.0 |
| English AGIEval (Acc.) | 0-shot | 57.5 | 75.8 | 60.6 | 79.6 |
| Code HumanEval (Pass@1) | 0-shot | 43.3 | 53.0 | 54.9 | 65.2 |
| Code MBPP (Pass@1) | 3-shot | 65.0 | 72.6 | 68.4 | 75.4 |
| Code LiveCodeBench-Base (Pass@1) | 3-shot | 11.6 | 12.9 | 15.5 | 19.4 |
| Code CRUXEval-I (Acc.) | 2-shot | 52.5 | 59.1 | 58.5 | 67.3 |
| Code CRUXEval-O (Acc.) | 2-shot | 49.8 | 59.9 | 59.9 | 69.8 |
| Math GSM8K (EM) | 8-shot | 81.6 | 88.3 | 83.5 | 89.3 |
| Math MATH (EM) | 4-shot | 43.4 | 54.4 | 49.0 | 61.6 |
| Math MGSM (EM) | 8-shot | 63.6 | 76.2 | 69.9 | 79.8 |
| Math CMath (EM) | 3-shot | 78.7 | 84.5 | 77.3 | 90.7 |
| Chinese CLUEWSC (EM) | 5-shot | 82.0 | 82.5 | 83.0 | 82.7 |
| Chinese C-Eval (Acc.) | 5-shot | 81.4 | 89.2 | 72.5 | 90.1 |
| Chinese CMMLU (Acc.) | 5-shot | 84.0 | 89.5 | 73.7 | 88.8 |
| Chinese CMRC (EM) | 1-shot | 77.4 | 75.8 | 76.0 | 76.3 |
| Chinese C3 (Acc.) | 0-shot | 77.4 | 76.7 | 79.7 | 78.6 |
| Chinese CCPM (Acc.) | 0-shot | 93.0 | 88.5 | 78.6 | 92.0 |
| Multilingual MMMLU-non-English (Acc.) | 5-shot | 64.0 | 74.8 | 73.8 | 79.4 |
Evaluation results on the Needle In A Haystack (NIAH) tests. DeepSeek-V3 performs well across all context window lengths up to 128K.
| Benchmark (Metric) | DeepSeek V2-0506 | DeepSeek V2.5-0905 | Qwen2.5 72B-Inst. | Llama3.1 405B-Inst. | Claude-3.5-Sonnet-1022 | GPT-4o 0513 | DeepSeek V3 |
|---|---|---|---|---|---|---|---|
| Architecture (-) | MoE | MoE | Dense | Dense | - | - | MoE |
| # Activated Params (-) | 21B | 21B | 72B | 405B | - | - | 37B |
| # Total Params (-) | 236B | 236B | 72B | 405B | - | - | 671B |
| English MMLU (EM) | 78.2 | 80.6 | 85.3 | 88.6 | 88.3 | 87.2 | 88.5 |
| English MMLU-Redux (EM) | 77.9 | 80.3 | 85.6 | 86.2 | 88.9 | 88.0 | 89.1 |
| English MMLU-Pro (EM) | 58.5 | 66.2 | 71.6 | 73.3 | 78.0 | 72.6 | 75.9 |
| English DROP (3-shot F1) | 83.0 | 87.8 | 76.7 | 88.7 | 88.3 | 83.7 | 91.6 |
| English IF-Eval (Prompt Strict) | 57.7 | 80.6 | 84.1 | 86.0 | 86.5 | 84.3 | 86.1 |
| English GPQA-Diamond (Pass@1) | 35.3 | 41.3 | 49.0 | 51.1 | 65.0 | 49.9 | 59.1 |
| English SimpleQA (Correct) | 9.0 | 10.2 | 9.1 | 17.1 | 28.4 | 38.2 | 24.9 |
| English FRAMES (Acc.) | 66.9 | 65.4 | 69.8 | 70.0 | 72.5 | 80.5 | 73.3 |
| English LongBench v2 (Acc.) | 31.6 | 35.4 | 39.4 | 36.1 | 41.0 | 48.1 | 48.7 |
| Code HumanEval-Mul (Pass@1) | 69.3 | 77.4 | 77.3 | 77.2 | 81.7 | 80.5 | 82.6 |
| Code LiveCodeBench (Pass@1-COT) | 18.8 | 29.2 | 31.1 | 28.4 | 36.3 | 33.4 | 40.5 |
| Code LiveCodeBench (Pass@1) | 20.3 | 28.4 | 28.7 | 30.1 | 32.8 | 34.2 | 37.6 |
| Code Codeforces (Percentile) | 17.5 | 35.6 | 24.8 | 25.3 | 20.3 | 23.6 | 51.6 |
| Code SWE Verified (Resolved) | - | 22.6 | 23.8 | 24.5 | 50.8 | 38.8 | 42.0 |
| Code Aider-Edit (Acc.) | 60.3 | 71.6 | 65.4 | 63.9 | 84.2 | 72.9 | 79.7 |
| Code Aider-Polyglot (Acc.) | - | 18.2 | 7.6 | 5.8 | 45.3 | 16.0 | 49.6 |
| Math AIME 2024 (Pass@1) | 4.6 | 16.7 | 23.3 | 23.3 | 16.0 | 9.3 | 39.2 |
| Math MATH-500 (EM) | 56.3 | 74.7 | 80.0 | 73.8 | 78.3 | 74.6 | 90.2 |
| Math CNMO 2024 (Pass@1) | 2.8 | 10.8 | 15.9 | 6.8 | 13.1 | 10.8 | 43.2 |
| Chinese CLUEWSC (EM) | 89.9 | 90.4 | 91.4 | 84.7 | 85.4 | 87.9 | 90.9 |
| Chinese C-Eval (EM) | 78.6 | 79.5 | 86.1 | 61.5 | 76.7 | 76.0 | 86.5 |
| Chinese C-SimpleQA (Correct) | 48.5 | 54.1 | 48.4 | 50.4 | 51.3 | 59.3 | 64.8 |
| Model | Arena-Hard | AlpacaEval 2.0 |
|---|---|---|
| DeepSeek-V2.5-0905 | 76.2 | 50.5 |
| Qwen2.5-72B-Instruct | 81.2 | 49.1 |
| LLaMA-3.1 405B | 69.3 | 40.5 |
| GPT-4o-0513 | 80.4 | 51.1 |
| Claude-Sonnet-3.5-1022 | 85.2 | 52.0 |
| DeepSeek-V3 | 85.5 | 70.0 |