Hasty Briefsbeta

Bilingual

Show HN: I trained a language model that thinks the capital of Japan is Paris

9 hours ago
  • #small-scale AI
  • #DIMBA II
  • #language model
  • A 13-year-old trained a 300M parameter language model called DIMBA II, which incorrectly thinks Japan's capital is Paris, to explore limitations of small-scale models.
  • DIMBA II combines Mamba-2's context efficiency with diffusion-based parallel generation, using masked diffusion instead of latent-space diffusion to avoid generating word salad.
  • Training issues included a teacher model being off during distillation and targeting latent diffusion initially, leading to poor performance; salvage efforts involved repair runs and fine-tuning.
  • Experiments showed small models cannot self-correct effectively; methods like perplexity reranking and confidence-based remasking failed, but an external critic head improved error detection.
  • An inference 'dial' adjusts diffusion steps and candidate answers, trading speed for accuracy, but accuracy plateaued due to limited knowledge from the botched training.
  • Benchmarks against models like SmolLM-135M and GPT-2 showed DIMBA II has lower QA accuracy but advantages in infill tasks and reduced repetition due to its diffusion objective.
  • An experiment tested weight-sharing with LoRA adapters for bidirectionality, reducing parameters by 21% with minimal loss penalty, suggesting efficiency gains.
  • Future plans include a 1.5B-3B parameter run with fixed bugs, optimized training, and seeking funding to study self-correction evolution at larger scales.