Thinking Machines – Modular Manifolds

3 hours ago

Copy Link

Training large neural networks requires keeping tensors (weights, activations, gradients) healthy to avoid numerical issues.
Normalization is key for maintaining tensor health, commonly applied to activations (e.g., layer norm) and gradients (e.g., Muon optimizer).
Weight matrix normalization is less common but beneficial, as seen in models like EDM2, for stability and predictable behavior.
Manifold constraints on weight matrices offer a structured approach to optimization, ensuring weights stay on beneficial submanifolds.
The Stiefel manifold, where matrices have unit condition number, is proposed for constraining weight matrices in neural networks.
Manifold optimization involves steps: finding optimal tangent direction, updating weights, and retracting back to the manifold.
Modular manifolds extend these ideas to entire networks, budgeting learning rates across layers based on Lipschitz sensitivity.
Future work includes exploring modularity in constraints, improving numerics, and advancing convex optimization techniques for manifolds.

Hasty Briefsbeta