Starting from scratch: Training a 30M Topological Transformer

4 months ago

Tauformer is a topological transformer that replaces dot-product attention with Laplacian-derived scalar (taumode) per token/head.
Tauformer ranks keys by similarity of their Laplacian-derived taumode scalars, biasing attention toward domain-relevant relations.
Implementation retains Q/K/V projections, RoPE, causal masking, and softmax/value aggregation but changes attention logit computation.
Taumode scalar is computed via bounded Rayleigh-quotient energy, producing λ∈[0,1).
KV-cache stores (V, λₖ) instead of (K, V), reducing cache size by ~50%.
Training a 30M-parameter TauGPT model with AdamW, base LR 5e-4, and 100-step warmup.
Validation loss drops from 4.9255 at step 100 to 1.9146 at step 4500, with final perplexity of 6.59.
Taumode convergence correlates with cross-entropy loss, potentially indicating smoother K representations.
Future work includes adaptive taumode strategies and scaling to 100M parameters.
Tauformer's deterministic compression may increase learnable structure, aligning with epiplexity principles.

Hasty Briefsbeta