Cross-Entropy and KL divergence
Consider that we have an encoding scheme for sending codewords from a set
Here, the syntax presents
Now, imagine that the same codewords are sent, but the frequency distribution
of the codewords is changed to
The length of a codeword is still the same (
KL divergence (a pseudo-difference measure between distributions)
Cross-entropy gives us a way to express how different two probability
distributions are; the more different the distributions

And the same goes for the cross-entropy of

The differences,
In words
The average message length reduction of using a
Distance-like
The KL divergence is like a distance between two distributions; however, the KL divergence is not symmetric: it is possible for
The formulation of KL-divergence
There are two ways of thinking about the change in message length: you have an encoding, then the transmission distribution changes, or the transmission distribution stays the same and you switch encodings. The first situation might suggest a measure of difference like:
The KL-divergence represents the second situation (changing encoding with fixed transmission distribution):
(average length of the q-optimized encoding under p) - (average length of the p-optimized encoding under p)
The KL divergence makes more sense when viewed as a calculation, as the summations share the same factor:
Compare this to the calculation required for the first formulation:
This can't be simplified in the same way as the KL divergence can.