Show HN: I trained a language model that thinks the capital of Japan is Paris

22 days ago

A 13-year-old trained a 300M parameter language model called DIMBA II, which incorrectly thinks Japan's capital is Paris, to explore limitations of small-scale models.
DIMBA II combines Mamba-2's context efficiency with diffusion-based parallel generation, using masked diffusion instead of latent-space diffusion to avoid generating word salad.
Training issues included a teacher model being off during distillation and targeting latent diffusion initially, leading to poor performance; salvage efforts involved repair runs and fine-tuning.
Experiments showed small models cannot self-correct effectively; methods like perplexity reranking and confidence-based remasking failed, but an external critic head improved error detection.
An inference 'dial' adjusts diffusion steps and candidate answers, trading speed for accuracy, but accuracy plateaued due to limited knowledge from the botched training.
Benchmarks against models like SmolLM-135M and GPT-2 showed DIMBA II has lower QA accuracy but advantages in infill tasks and reduced repetition due to its diffusion objective.
An experiment tested weight-sharing with LoRA adapters for bidirectionality, reducing parameters by 21% with minimal loss penalty, suggesting efficiency gains.
Future plans include a 1.5B-3B parameter run with fixed bugs, optimized training, and seeking funding to study self-correction evolution at larger scales.

Hasty Briefsbeta