Three Mistakes Meta Made with Llama4
Meta has just released Llama 4, and their largest model still has not been released. The technical report is also not available, so some of this is speculative, but if you can’t critique large companies based on incomplete and shaky data, then what is the point of the Internet?
Tldr;
Meta built MoEs (and not even fine-grained ones) rather than dense models, which are slower to serve. They did not use fancy new attentions, so their KV caches will be too big for reasoning, limiting efficiency. They only used 32k of their 100k GPUs, wasting $2B in hardware maybe because they didn’t use context and tensor parallelism to scale.
MoE - slow to serve
The first big decision Llama 4 made was to use an MoE. MoEs have advantages over dense models in that they are 4x faster to train to a given performance level. Fine-grained MoEs are even better. A dozen researchers in Warsaw, who I’d love to meet one day, published “Scaling Laws for Fine-Grained Mixture of Experts” in February of 2024. They showed that MoEs are better than dense models, but fine-grained MoEs are even better. DeepSeek V3 was the first large model to take advantage of this.
However, Llama 4 is not fine-grained, so the models are only perhaps 4 times more efficient to train. A factor of four is nothing to sneeze at, but, as with all good things in life, it comes with a cost. It is more expensive to serve MoEs.
If you have a 17B active parameter MoE, serving a token requires pulling 17B parameters from memory. Serving a 70B dense model requires 70B parameters, so while their performance is similar, the MoE seems to require less memory access. If you are doing just one query, then a dense model will use speculative decoding and manage to infer 4 tokens at a time. An MoE that tries to use speculative decoding will pull all its parameters into memory for each attempt (which can be 16 or 128 times more parameters). Thus, for a single query, an MoE is as efficient as a dense model four times its size. Ouch.
What about batching? Once we batch, the entire MoE model is pulled into memory each time, so on a single GPU, the dense model will be faster. If it has n times fewer total (not active) parameters , it will be n times faster for small batches. If we split across multiple GPUs, so that separate experts are on separate GPUs, we can find a point where an MoE has an advantage. If the number of queries is large enough so that each GPU is compute-limited (not memory bound,) then the MoE will be four times faster than the equivalent (four times more active parameters) dense model.
KV cache limits batch size
Does this happen in practice? Alas no. The kinds of queries we care about have long contexts. We want to do reasoning, so our queries are probably several thousand (or tens of thousands) of tokens long. In this case, the KV cache begins to matter. As we increase the batch size, the KV cache increases and becomes the limiting factor. DeepSeek introduced Latent attention to control their cache size and Mistral, and later, Gemma used sliding attention for the same reason. Gemma notably only used regular attention every fifth layer, reducing their KV cache by up to a factor of five for long contexts. Sadly, Meta did not do this, their second error, so the KV cache limits the number of queries, and even on H200s, we can’t get to a point where we process enough queries in parallel to make MoEs worth it.
So, choosing MoEs was a mistake, especially as they did not get the additional advantage that fine-grainedness could have brought. Was there a third issue?
Expert Parallelism is not enough
Meta bought 128k (I refuse to believe that anyone would buy GPUs not in a power of 2) H100 GPUS and connected them in a single cluster. Each 8 GPUs are connected by 900G/s NVLink and all the GPUs are connected by 50G/s Infiniband. This cluster has just the right architecture and size to train a large dense LLM.
Llama 405b was trained on a 16k GPU cluster with a 16 deep, 4x interleaved pipeline, with 8x tensor parallelism. Each pipeline thus had 128 GPUs, so there were 128 pipelines. A context length of 8k meant that each machine processed 8k of data at a time, and 4x interleaving allowed the pipeline to reach 80% efficiency with 16 microbatches. This all works out as a global batch size of 16M, which is a little high (2M to 4M would have been much better, but I will write about that at another time) but which is respectable.
How did Meta plan to scale to 128k GPUs? For a dense model, we can scale up a factor of 8 by adding 8x context parallelism. This reduces the number of tokens a machine to 1k and thus allows for a 128k GPU cluster with a batch size of 16M. Reducing to 1k context does give a hit on mfu and we get the additional cost of transferring around the KVs. Each GPU now needs to send 1k of KV to 7 other GPUs. For the 405b model and a context of 1k, this is 128*7*1k*2 = 1.8MB, which takes less than 0.1ms over the Infiniband network. Each layer takes (2*10*s*d^2/8 ops) about a millisecond, adding a 10% overhead, which is acceptable.
The blog announcement of Llama 4 says they used 32k GPUs? Why did they fail to use the full cluster? MoEs are harder to train than dense models. Tensor parallelism has the great property that it keeps the size of the activations used large, context parallelism, in contrast reduces them linearly. Expert parallelism has the same property - it reduces the size of activations. When we use 1k context, 16x expert parallelism reduces the length of the activation to 64 tokens, which is way below the arithmetic intensity needed for GPUs. Because of this, we would need to combine the activations of many contexts before we apply the expert. This combination costs in terms of networking.
I would guess, following DeepSeek, that Llama 4 was trained with 16x expert parallelism and little or no tensor parallelism. Let’s first consider how DeepSeek would do this.
DeepSeek used 8 machines for each stage in their pipeline, and each machine processed 4k tokens through the attention layer, splitting by attention head. This gives 32k of context across the 8 machines. If we split the tokens by GPU, as is natural after the reduce scatter in attention, each GPU has 512 tokens of activation (This is called sequence parallelism). Naively, each GPU would send each other GPU of the same rank its 512 tokens, and then all the GPUs would all2all their 4k of tokens with the other 7 GPUs over NVlink so that every GPU had 32k. We could then decide which tokens to use for each expert. This would use 2 all2alls so we could just do NCCL calls.
However, we can do better than this by sending just the data that is needed. Each machine has 32 experts, so every machine probably needs most of the tokens (each expert needs 1/32 of the tokens, so 32 experts will require 64% of the tokens on average). DeepSeek decides to send each token to only half the machines (so, the machine it is created on and three others). This reduces the data we need to transfer by half. This means that each GPU sends 256 tokens to seven other GPUs on the seven other machines. Then, the GPUs use NVlink to send just the required tokens to the other GPUs in the machine. The internal dimension for DeepSeek is 7k, so this is 12MB of data. We also need to return twice this data (as the return values are in bf16) but SHARP reduces this to the same 12MB as we are doing all-reduces. So we require sending 24MB over 50G/s Infiniband, which takes 1ms. (Infiniband’s bandwidth is bidirectional). Luckily enough, the amount of compute we need also takes (2*10*1k * (7k)^2) 1Tflop or about the same amount of time if we use fp8.
Now, let’s consider the simplest way to support Llama 4. If we don’t use fine-grainedness, it will be even easier. We choose a 16 deep pipeline, 4x interleaved, so we need just 16 microbatches. We assume fp8. We have 16 experts, so each of our pipeline layers will have two machines, or 16 GPUs. Each expert is on its own GPU. We want to support 8k context, so we give 8k to each machine. Each attention layer splits by head, and after the RMSNorm, we have 1k tokens of activation per GPU. 1/16th of this, or 64 tokens needs to be sent to each other GPU. Half of the tokens go to GPUs on the other machine, so we first send that half to the other machine over Infiniband and then distribute the tokens to the right GPU using NVlink. Assuming 7k internal dimension, each GPU needs to send and receive (7k*512) = 3.5MB for 7MB total, which takes 0.3ms. The compute for a layer takes (2*10*s*d^2) = 1Tflop which at 50% mfu and fp8 is 1ms.
This needs 1k of tokens for each GPU, so at very large batch sizes like 32M we can use 32k GPUs. Llama 4’s blog mentions that the achieve 20% mfu are 32k GPUs, so this might be what they did. The batch size of 32M is terrible and Zuck is probably mad at us for not using all the GPUs. Can we think of a way to do better?
We can. We use a combination of context parallelism and tensor parallelism. Context parallelism spreads the work across more GPUs, but makes the context (or number of tokens processed) smaller. Tensor parallelism also spreads the work, but maintains the number of tokens (as it splits the parameters instead).
We use 8x context parallelism, 8x tensor parallelism, 16x expert, and 16x pipeline. Each pipeline layer now has 16 machines. We cut the number of tokens processed by each machine to 1k. This means that after the attention layer, each GPU has 128 tokens. The GPU now sends these 8 of these tokens (or 56k) to each other GPU with the same rank. This sends 120 * 7k = 840k of data over the Infiniband network. Each GPU receives 120 tokens and has 8 tokens of its own. The GPUs on a machine all-gather over the NVlink so that each has 1k tokens of activation. The expert processes this, generating an activation of 1k, and reduce-scatters it along the NVLink. Then each GPU now reduce-scatters the 128 tokens of activation they have over the Infiniband.
We send 1.6MB over Infiniband (0.06ms), and 7 times this over NVlink (0.03ms). Meanwhile, each layer takes (2*10*sd^2/m) where m is the tensor parallelism degree. As s is now 1k, this is 130 GFlops, so at 50% mfu and fp8 this takes 0.13ms. We spend more time on compute than networking.
DeepSeek wrote their own networking code, which is admirable, but I find it easier to use existing functions, and this has the advantage that you don’t need to support the codebase (and can point fingers if there are errors or inefficiencies). We need to send a different piece of the activation to different GPUs, so we can shuffle the activation so that the parts that go to each GPU are contiguous and use scatter. This presumes that the amount going to each GPU is about equal, which it better be.
This allows us to process just 1k per machine. On 32k GPUs, we can manage a batch of 4M. On 96k GPUs we could manage a 12M batch, which is better than the 16M (for 16k GPUs) Llama 3 was trained on.
However, this is still inefficient. We spend a lot of time networking. Can we do some overlap?
A little consideration will show that we can divide the work into several parts. Attention and routing, the expert-scatter, MLP of shared expert and experts, and the shared expert all-reduce, the expert all-reduce, and the expert-gather. The two compute parts (attention and the mlps) are each larger than the other two networking parts. We can fully overlap the computation with the communication. This overlap requires that we use two microbatches per pipeline stage. This means that for 32k GPUs we require a batch of 8M.
DeepSeek uses a complicated bi-directional pipeline as they worry about the size of activations. As we are using both tensor parallelism and context parallelism, the activations are reduced by a factor of 64, so we don’t need to worry about saving memory. Plus we use H100s which have more GPU memory. As a result, we use a straightforward pipeline, 4x interleaved, and just send in two microbatches at a time. This will also halve the amount of data that needs to be sent during the gradient cross copy, which is a bonus.
Furthermore, as we have spread the model over 128 GPUs per pipeline layer, we have greatly ameliorated the issue of all-reducing the gradients as the model is spread out over 8x more GPUs. (The context parallelism does not help us here, but the tensor parallelism does.) We also get to overlap the data transfer between pipeline stages with the computation in the obvious way.
So, the third error Llama made is failing to see that tensor and context parallelism can be used in conjunction to spread the model and computation over more GPUs and thus keep the batch size and gradient copy down.
Is it plausible that Meta did not make this error, and the blog post just mentions 32k GPUs to confuse me? Anything is possible, but perhaps post DeepSeek, Meta was in a rush to make an MoE and missed some obvious efficiencies. If so, then they probably wasted 64k or their 100k H100 GPUs, which cost $25k each, but since that is less than $2B. Peanuts really.