Kullback-Leibler divergence and Gibbs' inequality

2019.05.08

Kullback-Leibler divergence and Gibbs' inequality (answer)

Kullback-Leibler divergence and Gibbs' inequality Relative entropy or Kullback-Leibler divergence from one distribution \( P(x) \) to another \( Q(x) \), both defined over the same alphabet \( A_x \) is: \[ D_{KL}(P||Q) = \sum_x P(x) \log \frac{P(x)}{Q(x)} \] The relative entropy satisfies Gibbs' inequality: \[ D_{KL}(P||Q) \ge 0 \] With equality only if P = Q. The relationship is not symmetric under interchange of the distributions P and Q, so \( D_{KL}(P||Q) \ne D_{KL}(Q||P) \), so \( D_{KL} \) is not strictly a distance, dispited it being sometimes called the 'KL distance'. Read more...

2019.05.08

Entropy of an ensemble

\( \newcommand{\cat}[1] {\mathrm{#1}} \newcommand{\catobj}[1] {\operatorname{Obj}(\mathrm{#1})} \newcommand{\cathom}[1] {\operatorname{Hom}_{\cat{#1}}} \newcommand{\multiBetaReduction}[0] {\twoheadrightarrow_{\beta}} \newcommand{\betaReduction}[0] {\rightarrow_{\beta}} \newcommand{\betaEq}[0] {=_{\beta}} \newcommand{\string}[1] {\texttt{"}\mathtt{#1}\texttt{"}} \newcommand{\symbolq}[1] {\texttt{`}\mathtt{#1}\texttt{'}} \newcommand{\groupMul}[1] { \cdot_{\small{#1}}} \newcommand{\groupAdd}[1] { +_{\small{#1}}} \newcommand{\inv}[1] {#1^{-1} } \newcommand{\bm}[1] { \boldsymbol{#1} } \require{physics} \require{ams} \require{mathtools} \) Math and science::INF ML AI Entropy of an ensemble The entropy of an ensemble, \( X = (x, A_x, P_x) \), is defined to be the average Shannon information content over all outcomes: [\[ H(X) = \quad ? Read more...

2019.05.07

Entropy of an ensemble (answer)

\( \newcommand{\cat}[1] {\mathrm{#1}} \newcommand{\catobj}[1] {\operatorname{Obj}(\mathrm{#1})} \newcommand{\cathom}[1] {\operatorname{Hom}_{\cat{#1}}} \newcommand{\multiBetaReduction}[0] {\twoheadrightarrow_{\beta}} \newcommand{\betaReduction}[0] {\rightarrow_{\beta}} \newcommand{\betaEq}[0] {=_{\beta}} \newcommand{\string}[1] {\texttt{"}\mathtt{#1}\texttt{"}} \newcommand{\symbolq}[1] {\texttt{`}\mathtt{#1}\texttt{'}} \newcommand{\groupMul}[1] { \cdot_{\small{#1}}} \newcommand{\groupAdd}[1] { +_{\small{#1}}} \newcommand{\inv}[1] {#1^{-1} } \newcommand{\bm}[1] { \boldsymbol{#1} } \require{physics} \require{ams} \require{mathtools} \) Math and science::INF ML AI Entropy of an ensemble The entropy of an ensemble, \( X = (x, A_x, P_x) \), is defined to be the average Shannon information content over all outcomes: \[ H(X) = \sum_{x \in A_x}P(x) \log \frac{1}{P(x)} \] Properties of entropy: Read more...

2019.05.07

Joint entropy of two random variables

2019.05.07

Joint entropy of two random variables (answer)

Joint entropy of two random variables For two ensembles, \( X = (x, A_x, P_x) \) and \( Y = (y, A_y, P_y) \), where there may be dependency between \(P_x \) and \(P_y \), the joint entropy of \(X\), \(Y\) is: \[H(X, Y) = \sum_{x \in A_x} \sum_{y \in A_y} P(x, y)log \frac{1}{P(x, y)} \] Entropy is additive for independent random variables. Proof \[ \begin{align*} H(X, Y) &= \sum_{x \in A_x} \sum_{y \in A_y} P(x)P(y)log\frac{1}{P(x)P(y)} \\ &= \sum_{x \in A_x} \sum_{y \in A_y} P(x)P(y)log\frac{1}{P(x)} +\sum_{x \in A_x} \sum_{y \in A_y} P(x)P(y)log\frac{1}{P(y)} \\ &=\sum_{x \in A_x}P(x)log\frac{1}{P(x)} + \sum_{y \in A_y} P(y)log\frac{1}{P(y)} \text{ (the first sum's terms are independent of y, and the second's independent of x)}\\ &= H(X) + H(Y)\end{align*} \]

2019.05.07

Shannon information content

\( \newcommand{\cat}[1] {\mathrm{#1}} \newcommand{\catobj}[1] {\operatorname{Obj}(\mathrm{#1})} \newcommand{\cathom}[1] {\operatorname{Hom}_{\cat{#1}}} \newcommand{\multiBetaReduction}[0] {\twoheadrightarrow_{\beta}} \newcommand{\betaReduction}[0] {\rightarrow_{\beta}} \newcommand{\betaEq}[0] {=_{\beta}} \newcommand{\string}[1] {\texttt{"}\mathtt{#1}\texttt{"}} \newcommand{\symbolq}[1] {\texttt{`}\mathtt{#1}\texttt{'}} \newcommand{\groupMul}[1] { \cdot_{\small{#1}}} \newcommand{\groupAdd}[1] { +_{\small{#1}}} \newcommand{\inv}[1] {#1^{-1} } \newcommand{\bm}[1] { \boldsymbol{#1} } \require{physics} \require{ams} \require{mathtools} \) Math and science::INF ML AI Shannon information content For an ensemble, \( X = (x, A_x, P_x) \), the Shannon information content of an event, \( x \) is defined to be: \[ h(x) = [. Read more...

2019.05.07

Shannon information content (answer)

\( \newcommand{\cat}[1] {\mathrm{#1}} \newcommand{\catobj}[1] {\operatorname{Obj}(\mathrm{#1})} \newcommand{\cathom}[1] {\operatorname{Hom}_{\cat{#1}}} \newcommand{\multiBetaReduction}[0] {\twoheadrightarrow_{\beta}} \newcommand{\betaReduction}[0] {\rightarrow_{\beta}} \newcommand{\betaEq}[0] {=_{\beta}} \newcommand{\string}[1] {\texttt{"}\mathtt{#1}\texttt{"}} \newcommand{\symbolq}[1] {\texttt{`}\mathtt{#1}\texttt{'}} \newcommand{\groupMul}[1] { \cdot_{\small{#1}}} \newcommand{\groupAdd}[1] { +_{\small{#1}}} \newcommand{\inv}[1] {#1^{-1} } \newcommand{\bm}[1] { \boldsymbol{#1} } \require{physics} \require{ams} \require{mathtools} \) Math and science::INF ML AI Shannon information content For an ensemble, \( X = (x, A_x, P_x) \), the Shannon information content of an event, \( x \) is defined to be: \[ h(x) = log_2 \frac{1}{P(x)} \\ \text{Where 'x' may be an outcome: a subset of } A_x \] Read more...

2019.05.07

What is the derivative of negative log MLE (MLE used as a negative cost) when the variables are passed through softmax activations?

2019.03.19

What is the derivative of negative log MLE (MLE used as a negative cost) when the variables are passed through softmax activations? (answer)

What is the derivative of negative log MLE (MLE used as a negative cost) when the variables are passed through softmax activations? The derivative of the cost function is needed for back-propagation. When the output layer uses a MLE/cross-entropy cost and softmax activation, the derivative of the cost function combined with the activation function simplifies to a very simple expression. This note covers this computation. Consider the last layer of a neural network shown below: Read more...

2019.03.19

Kevin Doran