Hasty Briefsbeta

Bilingual

Starting from scratch: Training a 30M Topological Transformer

4 months ago
  • #Machine Learning
  • #Tauformer
  • #Transformer Architecture
  • Tauformer is a topological transformer that replaces dot-product attention with Laplacian-derived scalar (taumode) per token/head.
  • Tauformer ranks keys by similarity of their Laplacian-derived taumode scalars, biasing attention toward domain-relevant relations.
  • Implementation retains Q/K/V projections, RoPE, causal masking, and softmax/value aggregation but changes attention logit computation.
  • Taumode scalar is computed via bounded Rayleigh-quotient energy, producing λ∈[0,1).
  • KV-cache stores (V, λₖ) instead of (K, V), reducing cache size by ~50%.
  • Training a 30M-parameter TauGPT model with AdamW, base LR 5e-4, and 100-step warmup.
  • Validation loss drops from 4.9255 at step 100 to 1.9146 at step 4500, with final perplexity of 6.59.
  • Taumode convergence correlates with cross-entropy loss, potentially indicating smoother K representations.
  • Future work includes adaptive taumode strategies and scaling to 100M parameters.
  • Tauformer's deterministic compression may increase learnable structure, aligning with epiplexity principles.