Softmax, can you derive the Jacobian? And should you care?

4 days ago

#probability-distribution
#softmax-function
#neural-networks

Softmax transforms a vector of real numbers into a probability distribution by exponentiating each input and normalizing by the sum of all exponentials, mapping inputs onto a probability simplex.
The function exhibits 'winner-takes-most' behavior: it amplifies the largest logit and suppresses smaller ones, making predictions more decisive but potentially overconfident for uncertainty estimation.
Softmax is numerically unstable due to exponential overflow; a common fix is subtracting the maximum input value (max normalization) to prevent overflow without changing the output.
The Jacobian of softmax reveals coupling between dimensions: increasing one input increases its own output and decreases others due to the sum-to-one constraint, with a structure of a diagonal matrix plus a rank-1 correction.
Backward propagation through softmax can be computed efficiently without materializing the full Jacobian using elementwise operations and a dot product, reducing memory usage (critical for large vocabularies like in language models).
Softmax is often used with cross-entropy loss, yielding a simplified gradient expression: the difference between predicted probabilities and true labels; in practice, these operations are fused for efficiency.
In practice, softmax operates on batches and sequences with an axis parameter to specify normalization direction (e.g., over classes or sequence positions), and a temperature parameter can control distribution sharpness.
High temperatures flatten the softmax output towards uniformity, while low temperatures sharpen it towards a one-hot encoding, allowing control over model creativity versus determinism, especially in language models.

Hasty Briefsbeta

Softmax, can you derive the Jacobian? And should you care?