Datasets are Where Its At

❧

Published / Last Updated Apr 15, 2025

❧

Category Blog

I sometimes see people worrying that the weights of frontier models might leak from the top labs. They foresee a catastrophe where these secret weights leak or are exfiltrated and thus some geopolitical rival gets a head start on taking over the world.

Why are people worried about the weights leaking, rather than underlying datasets? It is worth considering the relative sizes of weights and datasets. Llama 4’s Behemoth is 2T of weights in fp8 and was (or will be) trained on 40T tokens. This makes the weights 40 times smaller than the training data. However, most of the training data is freely available. After all, most of it comes from books and the Web, and contrary to some people’s opinion, those don’t belong to Meta and the other frontier labs.

These plain datasets are not the only thing that is used for training of course. You need to prune these sources, you need to weight them, and if training for coding or math you need to generate synthetic data.

Of the various frontier labs, I think Alibaba is the clearest on their training set. QWQ (Qwen with Questions) is a 32B model and is fairly close to the other large reasoning models while still being relatively small and easy to run. It probably was state of the art when it was released but has been leap frogged by DeepSeek, Google, and Anthropic as they released their reasoning models. QWQ is based on Qwen2.5 32B, which is a very standard transformers++ model. It uses (with the small exception of QKV bias, and untying embedding) the architecture that PaLM introduced in May of 2022 (for those with short memories, that is 6 months before ChatGPT). All the usual innovations were there, SwiGLU, RoPE, Multi-Query Attention. Basically, exactly the architecture that existed pre-ChatGPT can still achieve state of the art results.

Nothing has changed in optimizers either (though I expect some exciting changes there soon, but I am an optimist), so reaching state of the art is entirely dependent on the data that Alibaba uses. Luckily, they are very open about their sources. They use the data from Qwen Math and Qwen Coder, they filter the usual sources, they use some synthetic data from Qwen72B for coding and knowledge domains, and they weight data by topic, downweight e-commerce, social media, and entertainment, and preferring technology, science, and academic research.

Qwen2.5 Math starts with a filtered deduplicated crawl. They use this as reference material to generate question answer pairs. I suppose with hindsight they would now use reasoning and rejection filtering, but they had to bootstrap, as reasoning models were not available when they did this. This gives them ~700B tokens.

Qwen2.5 Coder uses code (of course) and they relayed pull requests, commits, notebooks etc. They heavily filtered this, as most code (like most things) is rubbish/ code related text data from the common crawl. Manuals, documentation, blogs, and tutorials are collected. They generate synthetic code, and check it runs. Again, we might now ask the model to generate test cases and check that these pass. They add in the Qwen2.5 Math corpus, and finally add some high quality data.

People have often wondered what the right mix of education is. How much of the humanities should a child be exposed to, and how much math. Qwen2.5 Coder answers this question by trying various proportions and finds that the ideal education (for a programmer) is 70% code, 20% general, and 10% math. I wonder if any universities will pay attention to this. :)

This yields 5.2T tokens. To this, they add filtered web crawl data, and some synthetic data which they prune with a reward model. All together, this gets them 18T tokens.

How much of this is hard to reproduce? To be completely fair, the pruning and filtering strategies that people use are very simple. The Gopher rules capture most of it, and beyond that people use a LLM to assess quality which is hit or miss. The real problem is that it is unclear to us humans what makes valuable training data.

It is possible that some other frontier labs have access to more data than Alibaba. For example, I have access to a 70 billion web page crawl I did 15 years ago. Perhaps Google and Microsoft kept copies of their own web crawls and have more web data than what is available in the Common Crawl. My experience of deep crawling suggests that the Web does not get better as you go deeper, so much of this data is probably not great. Scanned books are definitely a huge win, but we don’t use those of course, unless our lawyers tell us it is ok.

The Web is really huge. Whenever you look for another source, it usually turns out that it is dwarfed by the amount of data on the Web. The large sources, Github, Reddit, stack exchange, and Arxiv are well known. Paying people to write data for you is ultimately ineffective. Paying 100k people to write 1M tokens each (the equivalent of 5 novels) would only get you 100B tokens, which barely moves the needle.

Zuck and Elon are bullish on training over Facebook and Xitter data. My Facebook data is primarily people of quality discussing the finer points of ordinal analysis and alternate translations of fragments of the classical poets, but I am told I am an outlier. Is there really that much information in one more picture of a children’s first birthday party? Similarly, there might be nuggets of wisdom on X. To pick one at random:

"I accidentally bought an aggressive amount of cheese. I ate it. I then immediately said to myself: That was excess."

To paraphrase George Bush: Is our models learning anything useful from this?

To summarize, there is a well-known set of sources, and at best we can prune and filter these. Paying people to generate data does not scale and synthetic generation is really limited to domains where you can verify correctness (so numerical math and programs). The process for turning datasets into models is well understood and stable, and the architecture of models has not meaningfully changed since before ChatGPT. For there to be something worth exfiltrating from a top lab, one or more of three kinds of things would need to invented:

A new architecture for LLMs - a cleverer attention, better norms, new residual connections, different einsums, etc.
A way of generating better datasets, especially in domains without trivial correctness.
A new optimizer that gets better results than Adam. Someone announces this every year, but it never seems to pan out.

Without one or more of these, the weights are not really that much more valuable than the raw datasets - there is a well known way of turning one into the other.

Some progress is being made on the first topic. DeepSeek’s LSA and NSA, and Sliding Attention (Mistral, Gemma) and fine-grained experts seem like real progress, as does qk-norm (Gemma again). I am bullish on Meta’s tanh normalization, and academia is definitely coming up with new ideas. I hope to blog about some of these, and Ceramic’s innovations in the next few months.

Less progress is being made in the latter two areas. Luckily, Ceramic has some ideas there as well. We will keep you posted.

Datasets are Where Its At

Hey there ai wanderer!