Softmax, can you derive the Jacobian? And should you care?
4 days ago
- #probability-distribution
- #softmax-function
- #neural-networks
- Softmax transforms a vector of real numbers into a probability distribution by exponentiating each input and normalizing by the sum of all exponentials, mapping inputs onto a probability simplex.
- The function exhibits 'winner-takes-most' behavior: it amplifies the largest logit and suppresses smaller ones, making predictions more decisive but potentially overconfident for uncertainty estimation.
- Softmax is numerically unstable due to exponential overflow; a common fix is subtracting the maximum input value (max normalization) to prevent overflow without changing the output.
- The Jacobian of softmax reveals coupling between dimensions: increasing one input increases its own output and decreases others due to the sum-to-one constraint, with a structure of a diagonal matrix plus a rank-1 correction.
- Backward propagation through softmax can be computed efficiently without materializing the full Jacobian using elementwise operations and a dot product, reducing memory usage (critical for large vocabularies like in language models).
- Softmax is often used with cross-entropy loss, yielding a simplified gradient expression: the difference between predicted probabilities and true labels; in practice, these operations are fused for efficiency.
- In practice, softmax operates on batches and sequences with an axis parameter to specify normalization direction (e.g., over classes or sequence positions), and a temperature parameter can control distribution sharpness.
- High temperatures flatten the softmax output towards uniformity, while low temperatures sharpen it towards a one-hot encoding, allowing control over model creativity versus determinism, especially in language models.