Starting from scratch: Training a 30M Topological Transformer
4 months ago
- #Machine Learning
- #Tauformer
- #Transformer Architecture
- Tauformer is a topological transformer that replaces dot-product attention with Laplacian-derived scalar (taumode) per token/head.
- Tauformer ranks keys by similarity of their Laplacian-derived taumode scalars, biasing attention toward domain-relevant relations.
- Implementation retains Q/K/V projections, RoPE, causal masking, and softmax/value aggregation but changes attention logit computation.
- Taumode scalar is computed via bounded Rayleigh-quotient energy, producing λ∈[0,1).
- KV-cache stores (V, λₖ) instead of (K, V), reducing cache size by ~50%.
- Training a 30M-parameter TauGPT model with AdamW, base LR 5e-4, and 100-step warmup.
- Validation loss drops from 4.9255 at step 100 to 1.9146 at step 4500, with final perplexity of 6.59.
- Taumode convergence correlates with cross-entropy loss, potentially indicating smoother K representations.
- Future work includes adaptive taumode strategies and scaling to 100M parameters.
- Tauformer's deterministic compression may increase learnable structure, aligning with epiplexity principles.