Hasty Briefsbeta

Thinking Machines – Modular Manifolds

5 hours ago
  • #neural networks
  • #manifold learning
  • #optimization
  • Training large neural networks requires keeping tensors (weights, activations, gradients) healthy to avoid numerical issues.
  • Normalization is key for maintaining tensor health, commonly applied to activations (e.g., layer norm) and gradients (e.g., Muon optimizer).
  • Weight matrix normalization is less common but beneficial, as seen in models like EDM2, for stability and predictable behavior.
  • Manifold constraints on weight matrices offer a structured approach to optimization, ensuring weights stay on beneficial submanifolds.
  • The Stiefel manifold, where matrices have unit condition number, is proposed for constraining weight matrices in neural networks.
  • Manifold optimization involves steps: finding optimal tangent direction, updating weights, and retracting back to the manifold.
  • Modular manifolds extend these ideas to entire networks, budgeting learning rates across layers based on Lipschitz sensitivity.
  • Future work includes exploring modularity in constraints, improving numerics, and advancing convex optimization techniques for manifolds.