The Yield Rate of Large Training Clusters

When we consider chip fabrication’s production, we normally think about the yield rate and the proportion of silicon that makes working chips. However, we can also apply an analogous metric for AI data centers’ manufacture of large language models, particularly the fabrication.
So what percentage of the Flops, or GPU seconds, are used for released models?
We waste LLM production time when the cluster is idle, if a job crashes due to hardware or software, when a model diverges or simply when a mistake is made. Some of these issues, like hardware failures, are trickier to overcome. But, in reality, many of these problems can ultimately be attributed to poor management.
In an ideal world, big jobs should never suffer from misconfiguration or crashes. Granted, some divergence is inevitable when pushing the frontier, but hopefully most bugs can be identified when testing smaller runs. The preliminary chip-fab sized runs should pick up on these flaws, not the final runs.
I hate to pick on Meta, as they are the most open of the labs, but their transparency makes it easier to estimate their cumulative yield. To be fair to them, no pun intended, Ceramic’s yield is 0 since we have not yet released a model :)
Once we know the number of parameters in a model and how many tokens were used in training, there is just one more piece of the puzzle needed to determine the length of training. This is the MFU, which is typically 40% but drops to 20% for MoE models trained at fp8 (as per DeepSeek and Llama4’s numbers explain). As fp8 simply doubles the theoretical output, this makes things easy - we take 40% of the bf16 Flops as a standard estimate.
The Llama 3 405B model was trained for 15T tokens with a 16M batch size on 16k GPUs. This means it used 6 405G 1024, or 2.4k Tflops. Llama reports 40% mfu, so this effectively took 6 seconds a batch. They did 940k batches, meaning that this lasted roughly 65 days.
Similarly, Behometh has 290B active parameters and 390 TFlops for 1k. Scout was trained for 40T tokens so on 32k GPUs this would take 62 days on 32k GPUs.
Meta says they have 600k H100 GPU equivalents which probably includes other GPUs. Let’s presume they have 350k H100s, another quoted figure. Meta has released two models (an instruct version and a regular version) this year to date. Whilst fine-tuning takes some time, it doesn’t really impact cluster usage, as it uses far fewer machines. These models were trained for 40T and 22T tokens with a 17B active parameter model, meaning they took between 17 and 30 times less than the Behemoth model. Including the unrelated model, Meta seems to have released 10k GPU worth of models altogether, despite having 350k GPUs. This is a 3% yield.
Mark Zuckerberg stated on a Meta earnings call earlier this week that the company is training Llama 4 models “on a cluster that is bigger than 100,000 H100 AI GPUs, or bigger than anything that I’ve seen reported for what others are doing.” Even if we reduce our estimate to 100k GPUS, we are seeing a plus 10% yield.
What about last year, back when they had fewer GPUs? They had three big releases: the Llama 3.3 70B model trained on 15T tokens; the Llama 3.2 90B vision model, trained for 2M hours of H100; and a smaller 11B Vision model. The Llama 3.1 release had a 405B model, a 70B model, and an 8B model - it trained for a collective 40M hours.
So, from July onwards (presuming that training on the 3.1 models stopped a month before release) Llama’s yield would have probably been 9M H100 hours. They had two 24k clusters for training earlier in the year, and we know they got their 100l cluster by the end of October. The 24k clusters give 210M H100 hours, while the new 100k cluster adds another 144M hours. There seems to be an emerging pattern - we see three significant models, adding up to 11M hours, so this produces a yield of 3%.
What are the causes of low yield?
Machine failure and NCCL: NCCL was not built with fault-tolerance in mind. Thus, if a single GPU fails, it can trigger a domino effect, bringing down others (if not the entire cluster). NCCL does not restart easily and Nvidia advises against restarting after failures. If clusters use NCCL, they are exposed to potential hardware failures that could bring down the entire cluster, directly jeopardizing their yield. Hardware failures are fairly rare, but as cluster size grows this can become a more major risk. LLM training is inherently stochastic - it should be easy to simply continue past failures, but networking complicates this further. The most precise thing to do is to use plain Infiniband or Ethernet (perhaps through GPUdirect) for data parallel and the pipeline parallel parts of the networking training stack, and to build-in the obvious pattern of just dropping GPUs who fail. It is entirely possible that large labs have already built software of this kind.
Failed runs - divergence: Sometimes, bad runs happen to good people. An example of this is how Allen AI found that initializing with the usual normal distribution creates loss spikes after 600B tokens. This can be ameliorated if all values behind 3 std dev are clipped to 3 std dev, although this kind of thing demands long runs. However, this is the exception and not the rule, and when something like this happens, the scientific community would really like to be told. Thank you Allen AI for spreading the word!
Failed runs - bad parameters, code, etc: Researchers should first run very small models on a single GPU, then on a single 8 GPU machine, then on an 8B model, and so on before they spend $100M in compute. But, inevitably, people will try to walk before they can run. This is almost always a mistake.
Cluster idle: Believe it or not, people really do leave clusters idle for days at a time. One excuse is that the next big job is not ready to run and if another job is started then the other people won’t want to give up the cluster. Undoubtedly, the big job then gets delayed. Confidence in management’s ability to kick interim jobs off the cluster goes a long way here.
Unreleased models: Sometimes, management will block the release of impressive products - like the carburetor that gets to 100mpg. The patent office has issued 6,500 patents on high mileage carburetors but somehow the EPA bypasses them, even though they meet all clean air standards. Many speculate that big oil companies are buying up these ideas, only to ferret them away and shelve them. Sounds like an acquihire to me. Management makes mistakes, but they are not always wrong.
Researchers failing to understand costs: Most researchers are PhD students who can’t imagine something that costs more than $100. The idea of spending millions is very far from their reality. Physicists have demanded billions from the government to make colliders and sent up space craft for “academic” reasons, but historically computer scientists are seldom given much compute. Like all new money, researchers don’t know how to gauge what is appropriate.
Is Meta an outlier?
xAI Estimate: We can try to do the same analysis for other companies, but this is a little harder as they are less ‘open’. For instance, we might guess at the size of Grok, as we know it is served on DGX boxes, and its tokens-per-second give an upper bound on the size of the model. The model (or its active parameters) needs to be loaded into memory for each token. A DGX box has a memory bandwidth 25T/s. Grok generates tokens as 68/s, so its models can be at most 400G - but 200G is more likely. Let’s assume they serve quantized at fp8, but trained in bf16. They most likely trained for 40T tokens. This would be 40M H100 hours. They have between 100k or 200k GPUs for at least 6 months before releasing Grok 3, and have not released models between them. This is 432M H100 hours. By these numbers, xAI has perhaps 10% efficiency at training.
Anthropic Estimate: A similar analysis suggests Claude 3 Opus is about 600G in size (as it serves at 27 tokens a second). Anthropic’s newest flagship AI model, Claude 3.7 Sonnet, cost “a few tens of millions of dollars” to train using less than 1026 FLOPs of computing power. That is, according to Wharton professor Ethan Mollick’s X post in February (2025), which relayed the clarification he’d received from Anthropic’s PR: “I was contacted by Anthropic who told me that Sonnet 3.7 would not be considered a 1026 FLOP model and cost a few tens of millions of dollars,” he wrote, “though future models will be much bigger.” If we look at Claude model after June of last year, we see there are Opus 4 (May 2025), Sonnet 4 (May 2025), Sonnet 3.7 (Feb 2025), Haiku 3.5 (Oct 2024) Sonnet 3.5 v2 (Oct 2024), and missing the cutoff Sonnet 3.5 (Jun 2024).
Using a baseline the Llama 3, 405B model usage of 3.8x1025 Flops (these measure actual flops used, not those available. We divide by MFU to get the flops paid for). Further, a few tens of millions means $30M, then that buys (at $2 a H100 hour) 15M GPU hours. This is half of what Llama 405B used. If Sonnet was trained for 30T tokens, then this suggests it is about 200B parameters in size. Opus runs at more than half the speed of Sonnet, and this suggests it is in the range of < 400B parameters. Thus, Antrophic likely trained three models that cost $30M each, and one that cost time that in the last year. This is aggregately $150M in released training. Over that time they report using $2.7B in training, for a yield of 5.5%.
OpenAI Estimate: In the same time frame that OpenAI released Sora, the o1 reasoning model (September) , the o3 model (December), the o3-pro (Feb), and the GPT4.1 and the o4-mini (both April). The o3-pro is compatible with Opus in speed, and the others are comparable to Sonnet in speed. If their costs are similar, then Open AI’s claimed training costs of $3B gives them a yield of 5%. This may flatter OpenAI, as the O1 model may be a fine-tune of an earlier model. GPT4.1 and O4 Mini may be in the range of 30B and 8B and if this is the case, OpenAI may have yielded lower results than Meta.
When Jeff Dean was running things, utilization was high as researchers were deeply invested and voted with units of work, as described in posts by researchers. I have no idea what is happening there now. Their culture of publishing has noticeably slowed, so it is really hard to tell. Even back in the Golden Age there was considerable waste. Yi Tay says that the UL20 model, once the best performing open source model, was trained on 1T tokens over a month at Christmas because he forgot to turn off the job.
What is an acceptable yield? I have some familiarity with running jobs on clusters of thousands of machines. Search engines process data for indexing and ranking and the associated data mining and those clusters would, in well managed companies, manage about 70% efficiency. Although this seemed terrible at the time, it now appears quite good compared to 3%.
If I had to guess, these problems can be attributed to the fact that 20% of time being lost to machines being physically unavailable. Without a culture that demands that operations provide at least a single 9, it is easy to slip to clusters being down for one reason or another. Sometimes, pushing to upgrade something can be misguided, and we might end up with a whole weekend lost.
The second 20% is wasted on plain idleness. This is usually a management failure. The powers that be decided to change strategic direction and while this is discussed, the cluster work stops and everyone waits on a decision of the new plan. The third 20% is lost to people running jobs that they realize are bad ideas half way through. Dumb hyper parameters, bad design, the data sets that are wrong can all bring a job to ruin. This should be found when running a limited scale experiment, but, without strong management, people will YOLO things. Naturally, things fall to the wayside. The fourth 20% is actual divergence and failed jobs. Again, most of these are due to failures to test, but some are real findings that could not be done any other way. Perhaps 10% of the time is spent on valuable experiments finding things that did not work. However, scientific progression insists that these kinds of failed experiments should be published. It is a huge shame if valuable lessons are not shared, as we all know, knowledge = power, and this knowledge should be made freely accessible. That leaves 3% of released work, and many double that in unrelated models. Do I believe these exist? Maybe, but again it is a pity that they are not out there to serve as a lesson.
My mother was a social worker, and would therefore tell us that there were no bad children, just bad parents (luckily, she became a social worker after her children were raised, otherwise this belief would have had consequences in child rearing). It's a classic case of nature vs nurture; like children, I believe there are no bad researchers… just bad management.