Cost and Efficiency of DeepSeek

❧

Published / Last Updated Feb 4, 2025

❧

Category Blog

DeepSeek has been releasing models for over a year. DeepSeek V1 was a 67B dense model trained on 2 trillion tokens, comparable in size and quality to Meta’s Llama 1 and 2 models, which were trained for similar amounts of time. DeepSeek V2 (released in May 2024) was an MoE model with 21B active and 236B total parameters, trained on 8 trillion tokens. This was notably weaker than Llama 3 (70B), which is expected as it utilized much less compute. Deep Seek V3 is a larger model trained on 14.8T tokens with 37B active and 671B total parameters.

The cost of training a model scales linearly with the number of active parameters and tokens. For reference, Llama 1 (65B) required 1 million A100 GPU hours to train 1.4 trillion tokens at about 45% MFU. Deep Seek V3 is about half the size of the Llama model in terms of activated parameters and uses FP8 instead of BF16. Switching to FP8 roughly halves training costs, with the risk of the models being brittle.

Model

Active Parameters

Tokens Trained

Model Cost on A100 @45% MFU (bf16)

DeepSeek Reported Cost on H800

Llama 1

65B

1.4T

$1.2M

Llama 2

70B

2.0T

$1.8M

Llama 3

70B

14.0T

$13.1M

Llama 3

405B

14.0T

$72.0M

DeepSeek

67B

2.0T

$1.7M

$1.2M

DeepSeek V2

21B

8.0T

$2.3M

$2.8M (bf16)

DeepSeek V3

37B

14.8T

$7.0M

$5.6M (fp8)

DeepSeek reports 300k GPU hours per trillion tokens to train the 67B V1 model on H800s. An H800 is 2.4 times faster than an A100 (755 vs 312 TFlops). Therefore Deep Seek's performance is thus 70% of Llama's reported MFU or 31%. On dense models, DeepSeek is less than half as efficient as Ceramic's stack.

DeepSeek reportedly spent $5.6M on compute, using 2.8M hours of H800s. They reported that training 1T tokens took 173K GPU hours for a model with 21B active parameters, implying their MoE models operate at 28% MFU. This is close to the performance that OpenAI is rumored to have had training GPT4 on A100s back in the day. For DeepSeek’s latest release, V3, training lasted twice as long and used twice the tokens as V2. However, with FP8 precision and more parameters, we estimate that V3 achieved 88% of V2’s efficiency, bringing its MFU down to 23%.

With the 4K context used by DeepSeek, Ceramic can achieve an MFU 2-3x higher, reaching 60% on H100s (or close to 80% on A100s). While Ceramic has access to more advanced chips, even with H100s instead of H800s, we see 2x or 3x the performance in training. Furthermore, Ceramic can train with long contexts at the same speed. While DeepSeek pre-trained with only 4k context, Ceramic's stack can efficiently train a 70B model on 64k context 3x the efficiency at 72% MFU.

What shocked the market wasn’t the performance of the base model. Few paid attention to DeepSeek V1 or V2, even with V2’s introduction of latent attention and its MoE architecture. The real game changer was the addition of o1-style reasoning, a post-training step that teaches models to generate better outputs. In this process, the model produces multiple answers to a set of questions, these answers are scored, and the model is iteratively updated to improve performance. DeepSeek R1 invested around $100k in reinforcement learning (RL), which is the key factor that made the difference.

They only spent $100k on RL because they were limited by the number of high-quality question-answer pairs available. Unlike models that use a value model to evaluate responses, DeepSeek R1 required problems with clear, objective solutions. Since they primarily trained on Math problems, the R1 model excels at math but its skills do not transfer to other domains. Expanding beyond math would require high-quality question-answer pairs in other fields or, more realistically, a way to score answers in those domains. For example, in the legal domain, evaluating reasoning quality is far more important than simply determining whether an argument concludes with “guilty” or not guilty.”

Due to limitations in text data, pre-training has a hard time building models significantly stronger than 70B dense models. However, if these models are capable of handling very long contexts, they can reason more effectively. This enables them to be trained for reasoning in any specific domain, as long as there is post-training data to help them improve. Post-training data is harder to obtain than pre-training data, which can be collected by simply crawling the web. To post-train, high-quality on-domain problems are needed, and more importantly, a methodology to correct and score the generated answers. Ceramic has developed a large set of novel ways to evaluate the correctness of answers and a very high-performance stack that enables new approaches to RL.

Cost and Efficiency of DeepSeek

Hey there ai wanderer!