Model Library

Browse and deploy state-of-the-art AI models through the DevUp Gateway.

Model Information

Introduction

We present DeepSeek-V3, a strong Mixture-of-Experts (MoE) language model with 671B total parameters with 37B activated for each token. To achieve efficient inference and cost-effective training, DeepSeek-V3 adopts Multi-head Latent Attention (MLA) and DeepSeekMoE architectures, which were thoroughly validated in DeepSeek-V2. Furthermore, DeepSeek-V3 pioneers an auxiliary-loss-free strategy for load balancing and sets a multi-token prediction training objective for stronger performance. We pre-train DeepSeek-V3 on 14.8 trillion diverse and high-quality tokens, followed by Supervised Fine-Tuning and Reinforcement Learning stages to fully harness its capabilities. Comprehensive evaluations reveal that DeepSeek-V3 outperforms other open-source models and achieves performance comparable to leading closed-source models. Despite its excellent performance, DeepSeek-V3 requires only 2.788M H800 GPU hours for its full training. In addition, its training process is remarkably stable. Throughout the entire training process, we did not experience any irrecoverable loss spikes or perform any rollbacks.

Model Summary

Architecture: Innovative Load Balancing Strategy and Training Objective

On top of the efficient architecture of DeepSeek-V2, we pioneer an auxiliary-loss-free strategy for load balancing, which minimizes the performance degradation that arises from encouraging load balancing.
We investigate a Multi-Token Prediction (MTP) objective and prove it beneficial to model performance. It can also be used for speculative decoding for inference acceleration.

Pre-Training: Towards Ultimate Training Efficiency

We design an FP8 mixed precision training framework and, for the first time, validate the feasibility and effectiveness of FP8 training on an extremely large-scale model.
Through co-design of algorithms, frameworks, and hardware, we overcome the communication bottleneck in cross-node MoE training, nearly achieving full computation-communication overlap.
This significantly enhances our training efficiency and reduces the training costs, enabling us to further scale up the model size without additional overhead.
At an economical cost of only 2.664M H800 GPU hours, we complete the pre-training of DeepSeek-V3 on 14.8T tokens, producing the currently strongest open-source base model. The subsequent training stages after pre-training require only 0.1M GPU hours.

Post-Training: Knowledge Distillation from DeepSeek-R1

We introduce an innovative methodology to distill reasoning capabilities from the long-Chain-of-Thought (CoT) model, specifically from one of the DeepSeek R1 series models, into standard LLMs, particularly DeepSeek-V3. Our pipeline elegantly incorporates the verification and reflection patterns of R1 into DeepSeek-V3 and notably improves its reasoning performance. Meanwhile, we also maintain a control over the output style and length of DeepSeek-V3.

Model Downloads

Model	Total Params	Activated Params	Context Length	Download
DeepSeek-V3-Base	671B	37B	128K	🤗 HuggingFace
DeepSeek-V3	671B	37B	128K	🤗 HuggingFace

NOTE: The total size of DeepSeek-V3 models on HuggingFace is 685B, which includes 671B of the Main Model weights and 14B of the Multi-Token Prediction (MTP) Module weights.

To ensure optimal performance and flexibility, we have partnered with open-source communities and hardware vendors to provide multiple ways to run the model locally. For step-by-step guidance, check out Section 6: How_to Run_Locally.

For developers looking to dive deeper, we recommend exploring README_WEIGHTS.md for details on the Main Model weights and the Multi-Token Prediction (MTP) Modules. Please note that MTP support is currently under active development within the community, and we welcome your contributions and feedback.

4. Evaluation Results

Base Model - Standard Benchmarks

Benchmark (Metric)	# Shots	DeepSeek-V2	Qwen2.5 72B	LLaMA3.1 405B	DeepSeek-V3
Architecture (-)	-	MoE	Dense	Dense	MoE
# Activated Params (-)	-	21B	72B	405B	37B
# Total Params (-)	-	236B	72B	405B	671B
English Pile-test (BPB)	-	0.606	0.638	0.542	0.548
English BBH (EM)	3-shot	78.8	79.8	82.9	87.5
English MMLU (Acc.)	5-shot	78.4	85.0	84.4	87.1
English MMLU-Redux (Acc.)	5-shot	75.6	83.2	81.3	86.2
English MMLU-Pro (Acc.)	5-shot	51.4	58.3	52.8	64.4
English DROP (F1)	3-shot	80.4	80.6	86.0	89.0
English ARC-Easy (Acc.)	25-shot	97.6	98.4	98.4	98.9
English ARC-Challenge (Acc.)	25-shot	92.2	94.5	95.3	95.3
English HellaSwag (Acc.)	10-shot	87.1	84.8	89.2	88.9
English PIQA (Acc.)	0-shot	83.9	82.6	85.9	84.7
English WinoGrande (Acc.)	5-shot	86.3	82.3	85.2	84.9
English RACE-Middle (Acc.)	5-shot	73.1	68.1	74.2	67.1
English RACE-High (Acc.)	5-shot	52.6	50.3	56.8	51.3
English TriviaQA (EM)	5-shot	80.0	71.9	82.7	82.9
English NaturalQuestions (EM)	5-shot	38.6	33.2	41.5	40.0
English AGIEval (Acc.)	0-shot	57.5	75.8	60.6	79.6
Code HumanEval (Pass@1)	0-shot	43.3	53.0	54.9	65.2
Code MBPP (Pass@1)	3-shot	65.0	72.6	68.4	75.4
Code LiveCodeBench-Base (Pass@1)	3-shot	11.6	12.9	15.5	19.4
Code CRUXEval-I (Acc.)	2-shot	52.5	59.1	58.5	67.3
Code CRUXEval-O (Acc.)	2-shot	49.8	59.9	59.9	69.8
Math GSM8K (EM)	8-shot	81.6	88.3	83.5	89.3
Math MATH (EM)	4-shot	43.4	54.4	49.0	61.6
Math MGSM (EM)	8-shot	63.6	76.2	69.9	79.8
Math CMath (EM)	3-shot	78.7	84.5	77.3	90.7
Chinese CLUEWSC (EM)	5-shot	82.0	82.5	83.0	82.7
Chinese C-Eval (Acc.)	5-shot	81.4	89.2	72.5	90.1
Chinese CMMLU (Acc.)	5-shot	84.0	89.5	73.7	88.8
Chinese CMRC (EM)	1-shot	77.4	75.8	76.0	76.3
Chinese C3 (Acc.)	0-shot	77.4	76.7	79.7	78.6
Chinese CCPM (Acc.)	0-shot	93.0	88.5	78.6	92.0
Multilingual MMMLU-non-English (Acc.)	5-shot	64.0	74.8	73.8	79.4

Note: Best results are shown in bold. Scores with a gap not exceeding 0.3 are considered to be at the same level. DeepSeek-V3 achieves the best performance on most benchmarks, especially on math and code tasks. For more evaluation details, please check our paper.

Context Window

Evaluation results on the Needle In A Haystack (NIAH) tests. DeepSeek-V3 performs well across all context window lengths up to 128K.

Chat Model - Standard Benchmarks

Benchmark (Metric)	DeepSeek V2-0506	DeepSeek V2.5-0905	Qwen2.5 72B-Inst.	Llama3.1 405B-Inst.	Claude-3.5-Sonnet-1022	GPT-4o 0513	DeepSeek V3
Architecture (-)	MoE	MoE	Dense	Dense	-	-	MoE
# Activated Params (-)	21B	21B	72B	405B	-	-	37B
# Total Params (-)	236B	236B	72B	405B	-	-	671B
English MMLU (EM)	78.2	80.6	85.3	88.6	88.3	87.2	88.5
English MMLU-Redux (EM)	77.9	80.3	85.6	86.2	88.9	88.0	89.1
English MMLU-Pro (EM)	58.5	66.2	71.6	73.3	78.0	72.6	75.9
English DROP (3-shot F1)	83.0	87.8	76.7	88.7	88.3	83.7	91.6
English IF-Eval (Prompt Strict)	57.7	80.6	84.1	86.0	86.5	84.3	86.1
English GPQA-Diamond (Pass@1)	35.3	41.3	49.0	51.1	65.0	49.9	59.1
English SimpleQA (Correct)	9.0	10.2	9.1	17.1	28.4	38.2	24.9
English FRAMES (Acc.)	66.9	65.4	69.8	70.0	72.5	80.5	73.3
English LongBench v2 (Acc.)	31.6	35.4	39.4	36.1	41.0	48.1	48.7
Code HumanEval-Mul (Pass@1)	69.3	77.4	77.3	77.2	81.7	80.5	82.6
Code LiveCodeBench (Pass@1-COT)	18.8	29.2	31.1	28.4	36.3	33.4	40.5
Code LiveCodeBench (Pass@1)	20.3	28.4	28.7	30.1	32.8	34.2	37.6
Code Codeforces (Percentile)	17.5	35.6	24.8	25.3	20.3	23.6	51.6
Code SWE Verified (Resolved)	-	22.6	23.8	24.5	50.8	38.8	42.0
Code Aider-Edit (Acc.)	60.3	71.6	65.4	63.9	84.2	72.9	79.7
Code Aider-Polyglot (Acc.)	-	18.2	7.6	5.8	45.3	16.0	49.6
Math AIME 2024 (Pass@1)	4.6	16.7	23.3	23.3	16.0	9.3	39.2
Math MATH-500 (EM)	56.3	74.7	80.0	73.8	78.3	74.6	90.2
Math CNMO 2024 (Pass@1)	2.8	10.8	15.9	6.8	13.1	10.8	43.2
Chinese CLUEWSC (EM)	89.9	90.4	91.4	84.7	85.4	87.9	90.9
Chinese C-Eval (EM)	78.6	79.5	86.1	61.5	76.7	76.0	86.5
Chinese C-SimpleQA (Correct)	48.5	54.1	48.4	50.4	51.3	59.3	64.8

Note: All models are evaluated in a configuration that limits the output length to 8K. Benchmarks containing fewer than 1000 samples are tested multiple times using varying temperature settings to derive robust final results. DeepSeek-V3 stands as the best-performing open-source model, and also exhibits competitive performance against frontier closed-source models.

Open Ended Generation Evaluation

Model	Arena-Hard	AlpacaEval 2.0
DeepSeek-V2.5-0905	76.2	50.5
Qwen2.5-72B-Instruct	81.2	49.1
LLaMA-3.1 405B	69.3	40.5
GPT-4o-0513	80.4	51.1
Claude-Sonnet-3.5-1022	85.2	52.0
DeepSeek-V3	85.5	70.0

Note: English open-ended conversation evaluations. For AlpacaEval 2.0, we use the length-controlled win rate as the metric.