How large are large language models? (2025)
10 months ago
- #AI
- #LLM
- #Machine Learning
- GPT-2 models (2019) ranged from 137M to 1.61B parameters, trained on ~10B tokens from WebText dataset.
- GPT-3 (2020) had 175B parameters, trained on ~400B tokens from multiple sources including CommonCrawl and Wikipedia.
- GPT-3.5 and GPT-4 (2022, 2023) lack official details on architecture or training data.
- Llama models (7B to 65B) were pretrained on 1.4T tokens, with Books3 dataset being pivotal in AI training legal discussions.
- Llama-3.1 405B (2024) was a dense transformer model trained on 3.67T tokens, with less disclosed about training data.
- Llama-4 Behemoth 2T (2025) was a MoE model with 288B active parameters, involved in a benchmark scandal.
- Mistral released Mixtral 8x7B and 8x22B (2024), MoE models comparable in size to GPT-3, enabling broader access.
- DeepSeek-V3 (2024) was a 671B MoE model trained on 14.8T tokens, marking a significant leap in model size.
- DBRX (132B A36B) introduced fine-grained MoE with 16 experts, choosing 4, differing from other MoE models.
- MiniMax-Text-01 (456B A45.9B) utilized MoE and attention mechanisms, trained with a reward labeler model.
- dots.llm1 (143B A14B) achieved performance comparable to Qwen2.5-72B, trained on 11.2T tokens without synthetic data.
- Hunyuan-A13B (80B A13B) used MoE GQA, trained on 20T tokens with a 256K context length.
- ERNIE-4.5-VL-424B-A47B was trained on trillions of tokens, details on exact token count unclear.
- 405B is the latest large dense base model available, though annealed and containing recent data.
- Current trends favor MoE models, but benchmarks may not fully capture intelligence aspects requiring density.
- Future models may explore new architectures (RWKV, byte-latent, bitnet) and synthetic data generation techniques.