Thinking Machines – Modular Manifolds
3 hours ago
- #neural networks
- #manifold learning
- #optimization
- Training large neural networks requires keeping tensors (weights, activations, gradients) healthy to avoid numerical issues.
- Normalization is key for maintaining tensor health, commonly applied to activations (e.g., layer norm) and gradients (e.g., Muon optimizer).
- Weight matrix normalization is less common but beneficial, as seen in models like EDM2, for stability and predictable behavior.
- Manifold constraints on weight matrices offer a structured approach to optimization, ensuring weights stay on beneficial submanifolds.
- The Stiefel manifold, where matrices have unit condition number, is proposed for constraining weight matrices in neural networks.
- Manifold optimization involves steps: finding optimal tangent direction, updating weights, and retracting back to the manifold.
- Modular manifolds extend these ideas to entire networks, budgeting learning rates across layers based on Lipschitz sensitivity.
- Future work includes exploring modularity in constraints, improving numerics, and advancing convex optimization techniques for manifolds.