The derivative of the cost function is needed for back-propagation.
When the output layer uses a MLE/cross-entropy cost and softmax activation, the derivative of the cost function combined with the activation function simplifies to a very simple expression. This note covers this computation.
Consider the last layer of a neural network shown below:
For the given training example \( x_1 \), the correct category is 2 (out of all categories 0,1 and 2). Only the output corresponding to category 2 is used in the calculation of the loss contributed by the example. The loss is given by:
While only the probability for the correct category is used to calculate the loss for the example, the other outputs, \( p_1 \) and \( p_2 \) still affect the loss, as \( p_3 \) depends on these values (all sum to 1).
The partial derivatives will be:
So if we have output probabilities [0.1, 0.3, 0.6], then the derivative of the loss with respect to last layer's output before applying the softmax activation will be [0.1, 0.3, -0.4].
Formulating the derivative of loss wrt different variables is simplified if the log and exponential are canceled at the beginning:
From here, the partial derivates are obvious. For example:
The below image is part of a separate card, but I'm including it here, as it is quite relevant.