deepdream of
          a sidewalk
Show Question
What is the derivative of negative log MLE (MLE used as a negative cost) when the variables are passed through softmax activations?

The derivative of the cost function is needed for back-propagation.

When the output layer uses a MLE/cross-entropy cost and softmax activation, the derivative of the cost function combined with the activation function simplifies to a very simple expression. This note covers this computation.

Consider the last layer of a neural network shown below:


For the given training example x1, the correct category is 2 (out of all categories 0,1 and 2). Only the output corresponding to category 2 is used in the calculation of the loss contributed by the example. The loss is given by:

Lossex1=ln(p3)=ln(ef3ef1+ef2+ef3)

While only the probability for the correct category is used to calculate the loss for the example, the other outputs, p1 and p2 still affect the loss, as p3 depends on these values (all sum to 1).

The partial derivatives will be:

Lf1=p1Lf2=p2Lf3=(1p3)=p31

So if we have output probabilities [0.1, 0.3, 0.6], then the derivative of the loss with respect to last layer's output before applying the softmax activation will be [0.1, 0.3, -0.4].

Derivations

Formulating the derivative of loss wrt different variables is simplified if the log and exponential are canceled at the beginning:

Lossex1=ln(p3)=ln(ef3ef1+ef2+ef3)=ln(ef1+ef2+ef3ef3)=ln(ef1+ef2+ef3)ln(ef3)=ln(ef1+ef2+ef3)f3

From here, the partial derivates are obvious. For example:

Lf1=f1(ln(ef1+ef2+ef3)f3)=x(ln(x))f1(ef1+ef2+ef3)=1xef1=ef1ef1+ef2+ef3=p1

The below image is part of a separate card, but I'm including it here, as it is quite relevant.


Old image:


Some old stuff:

Given the loss (lets just call it L, with the example 1 being implicit), we wish to find: $\frac{\partial L}{\partial f_1} \text{, } \frac{\partial
L}{\partial f_2} \text{, } \frac{\partial L}{\partial f_3}$. First consider $f_1$:

Lf1=f1(ln(p3))=(x(ln(x))(y(ef3y+ef2+ef3))(f1(ef1))where x=ef3y+ef2+ef3, and y=ef1=(1x)(ef3(y+ef2+ef3)2)(ef1)=(ef1+ef2+ef3ef3)(ef3(ef1+ef2+ef3)2)(ef1)=ef1ef1+ef2+ef3=p1

Similarly, $\frac{\partial L}{\partial f_2} = p_2$. $\frac{\partial
L}{\partial f_3}$ is calculated as follows:

Lf3=f3(ln(p3))=(x(ln(x))(y(ef3ef1+ef2+y))(f3(ef3))where x=ef3ef1+ef2+y, and y=ef3=(1x)(1ef1+ef2+yef3ef1+ef2+y)(ef3)=(ef1+ef2+ef3ef3)(1ef1+ef2+ef3ef3ef1+ef2+ef3)(ef3)=(1ef3ef1+ef2+ef3)=p31