How large are large language models? (2025)

a year ago

#AI
#LLM
#Machine Learning

GPT-2 models (2019) ranged from 137M to 1.61B parameters, trained on ~10B tokens from WebText dataset.
GPT-3 (2020) had 175B parameters, trained on ~400B tokens from multiple sources including CommonCrawl and Wikipedia.
GPT-3.5 and GPT-4 (2022, 2023) lack official details on architecture or training data.
Llama models (7B to 65B) were pretrained on 1.4T tokens, with Books3 dataset being pivotal in AI training legal discussions.
Llama-3.1 405B (2024) was a dense transformer model trained on 3.67T tokens, with less disclosed about training data.
Llama-4 Behemoth 2T (2025) was a MoE model with 288B active parameters, involved in a benchmark scandal.
Mistral released Mixtral 8x7B and 8x22B (2024), MoE models comparable in size to GPT-3, enabling broader access.
DeepSeek-V3 (2024) was a 671B MoE model trained on 14.8T tokens, marking a significant leap in model size.
DBRX (132B A36B) introduced fine-grained MoE with 16 experts, choosing 4, differing from other MoE models.
MiniMax-Text-01 (456B A45.9B) utilized MoE and attention mechanisms, trained with a reward labeler model.
dots.llm1 (143B A14B) achieved performance comparable to Qwen2.5-72B, trained on 11.2T tokens without synthetic data.
Hunyuan-A13B (80B A13B) used MoE GQA, trained on 20T tokens with a 256K context length.
ERNIE-4.5-VL-424B-A47B was trained on trillions of tokens, details on exact token count unclear.
405B is the latest large dense base model available, though annealed and containing recent data.
Current trends favor MoE models, but benchmarks may not fully capture intelligence aspects requiring density.
Future models may explore new architectures (RWKV, byte-latent, bitnet) and synthetic data generation techniques.

Hasty Briefsbeta

How large are large language models? (2025)