deepdream of
          a sidewalk
Show Question

Motivating the logistic and softmax output layers through a great exercise.








The mess: 
\[ 
p(x1,...,xn)=p(x1|x2,...,xn)p(x2,...,xn)=p(x1|x2,...,xn)p(x2|x3,...,xn)p(x3,...,xn)=p(xn)i=1n1c1::p(xi|xi+1,...,xn)
\]

The basic idea is that Bayes's rule can be mutated as follows:

P(s=2|x=x)=P(x|s=2)P(s=2)i{2,3}P(x|s=i)P(s=i)=P(x|s=2)P(s=2)P(x|s=2)P(s=2)+P(x|x=3)P(s=3)=1+P(x|s=3)P(s=3)P(x|s=2)

an represent P(x=x1|s2) as a function of the dot product of two vectors, for example:

Let

x1=[1111111],

which repesents the LED set with only the 1st LED on and we define the vector

s2=[1111111],

then x1s2 can be related to the number of correct digits like so:

C=correctCount=x1s2+72

[What the hell happend below:]
\[   
Extra close brace or missing open brace
\]

Thus we see how the probability can be molded into an exponential form, and how Bayes's rule allows for the 11+ef(x) form. With more than two classes, it's easier to express the probability without dividing the numerator and denomitator by the numerator, so we are left with the softmax form as show in the question. Thus, it is easy to see how the softmax and the logistic output units are essentially the same thing: relative probabilities when expresses as exponentials.

A generality

If a random variable has two possible values and you can express the event probabilities as αx1 and αx2, then you can arrive at an exponential formulation like so:

αx1αx1+αx2=11+α(x2x1)=11+eβx Where x=x2x1.