Six (and a half) intuitions for KL divergence

a day ago

KL divergence measures how surprised a model expects to be when observing data from true distribution P, if it falsely believes distribution Q.
It quantifies the expected evidence for Q over P in hypothesis testing when P is true, and is minimized when Q is the maximum likelihood estimator for P.
KL divergence represents wasted bits in suboptimal coding if coding for Q when data follows P, and potential log winnings in gambling if exploiting false beliefs Q versus true P.

Hasty Briefsbeta