Hasty Briefsbeta

Bilingual

Softmax, can you derive the Jacobian? And should you care?

4 days ago
  • #probability-distribution
  • #softmax-function
  • #neural-networks
  • Softmax transforms a vector of real numbers into a probability distribution by exponentiating each input and normalizing by the sum of all exponentials, mapping inputs onto a probability simplex.
  • The function exhibits 'winner-takes-most' behavior: it amplifies the largest logit and suppresses smaller ones, making predictions more decisive but potentially overconfident for uncertainty estimation.
  • Softmax is numerically unstable due to exponential overflow; a common fix is subtracting the maximum input value (max normalization) to prevent overflow without changing the output.
  • The Jacobian of softmax reveals coupling between dimensions: increasing one input increases its own output and decreases others due to the sum-to-one constraint, with a structure of a diagonal matrix plus a rank-1 correction.
  • Backward propagation through softmax can be computed efficiently without materializing the full Jacobian using elementwise operations and a dot product, reducing memory usage (critical for large vocabularies like in language models).
  • Softmax is often used with cross-entropy loss, yielding a simplified gradient expression: the difference between predicted probabilities and true labels; in practice, these operations are fused for efficiency.
  • In practice, softmax operates on batches and sequences with an axis parameter to specify normalization direction (e.g., over classes or sequence positions), and a temperature parameter can control distribution sharpness.
  • High temperatures flatten the softmax output towards uniformity, while low temperatures sharpen it towards a one-hot encoding, allowing control over model creativity versus determinism, especially in language models.